This is work in progress. Will be completed while reading to the exam
Part 1: Statistical characterization of audiovisiual information
We want to digitize our signals for multiple reasons. Digital signals is more robust to noise introduced by storage and transmission. It allows for inclusion of error protection and makes encryption possible. It allows us to integrate speech and image so that the data is independent of the network. It makes packet transmission over the internet possible, and introduces the possibility of integration between transmission and switching. We can use digital signal processing on the digitized information, and it conforms with the development of moder digital circuits
1.1.1: Digital Speech
- Bandwidth: 50-3400 Hz
- Sampling: 8 KHz
- Representation: 8 bit Log PCM (13 bit linear PCM)
- Dynamic range: 42 dB
- Spectral components
- Component description (phonemes)
Digital speech is often presented as a spectrum (frequency/Amplitude) or spectrogram (frequency/time/amplitude)
1.1.2: Digital images/video
- Many proprietary formats, .flv, .avi, .mp4 etc. *Rates go from ~20kbps to 27 Mbps
Images has significant correlation between spectral components. The colors doesnt really matter, and conversion between different color spaces can be done fairly easy. There is significant correlation between neighbouring pixels. This means that we really dont have to treat every pixel for itself, but can instead look on a larger area of pixels. Natural images often exhibit large semi-uniform areas. Images are also non-stationary.
We have different tools for analysis and coding/compression of digital images.
- Fourier transform in two dimensions
- Subband decomposition
- Block transforms (DCT)
- Filter banks/Subband decomposition
- General filters
- Motion compensation
- Entropy coding
For a source coder, the output should be efficient (compression), minimum distortion (fidelity) and robuts (reconstruction) In the above picture we see the tradeoffs when dealing with coding. Time delay, compression ratio and complexity will all affect the overall recieved signal quality.
1.2.1: Signal decomposition
When dealing with audiovisual content there are two ways we can reduce the size of the files. Alot of the signal contains redundancy, for example big areas of uniformity in an image. This can be removed reversible, without loss of information. Some of the content also contains irrelevancy, for example frequencies above the human hearing range in audio content. This is an irreversible operation.
1.2.3: Rate distortion
There is a fundamental tradeoff between rate and distortion. This is given by the rate-distortion function
Part 2: Standards
2.1: Audio Compression
2.2: Image Compression
Image compression can be done either lossless or lossy. Lossless means that we can compress the images, and then decompress them and get the same image. With lossy compression, the images are altered in a way that it is not possible to recover the original images. We are mostly interested in lossy compression in this course, so the lossless formats will only be presented briefly
A picture is usally divided into slices, then macroblocks, then 8x8 pixel blocks.
The spectral redundancy can be removed by doing a color transform, for example YUV or grayscale conversion. The spatial redundancy reduction can be achieved with transform coding/subband decomposition and quantization. The temporal redundancy can be reduced by using motion compenstaion/estimation, predictive coding and temporal filtering. Entropy coding is a "free" way to get content more compressed. This can be achieved by using huffman coding or arithmetic coding.
2.2.1: Lossless formats
Stands for "Tagged Image File Format". Can handle color depths from 1-24 bit. Uses Lempel-Ziv-Welch lossless compression. TIFF is a flexible, adaptable file format for handling images and data within a single file by including the header tags (size, definition, image-data arrangement, applied image compression). Can be a container holding lossy and lossless compressed images.
Short for "Bitmap". Developed by windows. Stores two-dimensional digital images both monochrome and color, in various color depths, and optionally with data compression, alpha channels and color profiles.
Stands for "Graphics Interchange Format". Can only contain 256 colors. Commonly used for images on the web and sprites in software programs
Stands for "Portable Network Graphics". Supports 24-bit color. Supports alpha channel, so that an image can have 256 levels of transparency. This allows designers to fade an image to a transparent background rather than a specific color.
2.2.2: Lossy formats
JPEG is waveform based, and uses DCT. The process of JPEG encoding is as follows; First we do a color transform to reduce spectral redundancy, from RGB to YCbCr. We segment into 8x8 blocks, and do a DCT of each block. For JPEG we have hard-coded quantization tables that quantizes the blocks. We do a zigzag coefficient scanning, and code with huffman or arithmetic coding followed by run-length coding
JPEG2000 uses wavelets, and is used mainly for digital cinema. JPEG2000 is meant to complement, and not to replace JPEG. The most interesting additions is region of interest coding (ROI), better error resilience and more flexible progressive coding. It also allows lossy and lossless compression in one system. It provides better compression at low bit-rates. It is better at comput images and graphics, and better under error-prone conditions. It uses the discrete wavelet transform (DWT) instead of DCT for signal decomposition. This makes the artefacts from JPEG2000 "ringing", based on the removal of the high frequency components, instead of JPEGs blockiness.
Images may contain 1-16384 components, and sample pixel-values can be either signed or unsigned. It supports bit depth up to 38 bits. The first step is preprocessing of the images. The input images are partitioned into rectangular, non-overlapping tiles of equal size. The tile-size can vary from one pixel to the whole image. We get the DC offset by subtracting
The JPEG2000 is scalable, in the notion that it can achieve coding of more than one quality and/or resolution simultaneously. The coding is scaleable as well, as it generates a bitstream that we can get different quality/resolution from the bitstream
2.3: Video Compression
In video we have both spatial and temporal redundancy.
MPEG is a video and audio codec. For signal decomposition, all standards use DCT + quantization. It uses motion compensation as well as lossless entropy coding for motion vectors and quantized DCT samples. The encoder can trade-off between quality and bitrate. The default quantization matrix can be multiplied by a quantization parameter in order to obtain different quantization steps. Small steps gives high quality and bitrate, while large quantization steps git low quality and rate. The resulting bitrate cannot be known accurately from the quantization parameter. This is because the bitrate is dependant on spatial and temporal frame activity and on the GOP structure. In practice, constant bitrate and constant quality cannot be obtained at the same time
The MPEG-1 consists of five parts; system, video, audio, conformance testing and reference software. The goal was interactive movies on CD (1.4Mbit/s) with VHS quality video and CD-quality audio. MP3 is part of the MPEG-1 audio. It is also used in cable TV and DAB. The standard only specifies the bitstream-syntax and decoder, so as long as an encoder conforms to the bitstream-syntax it can be produced on your own.
Evolved from the shortcomings of MPEG1. Consists of nine parts, where part 1-5 is the MPEG-1 counterparts. MPEG2 brings better support for audiovisual data transport and boradcast applications. Different modes for motion compensation is used. It supportes support for interlaced video, and adds vertical scan mode for this. It also adds 4:2:2 subsampling in addition to 4:2:0. It defines profiles and levels for specific types of application. Profiles is the quality of the video, while levels is the resolution of the video. FPS ranges from 30-60 and rates is 4-80Mbps It has better scalability in terms of spatial(resolution), temporal (fps) and quality (SNR), includes advanced audio coding (AAC), Digital media storage commands and control, Real-time interfaces to transport stream decoders. It is used on DVDs, and digital broadcasting.
Absorbs parts from MPEG1/2 and has 21 parts. It allows for more content types, like graphics, animation, symbolic music, subtitles and more tools for networking and storage. It feautres the advanced video coding (AVC) It is used for HDTV broadcasting, IP-based TV, gaming, mobile video applications, streaming, video conferencing etc. It is suitable for mobile environments with limited link capacitym thanks to high compression efficiency and scalability. It is supported in several web-applications, but most application only supports the video and audio parts.
This standard is designed for multimedia content description. It is useful in digital library, search and boradcast selection applications. Makes it possible to do text-based search in audiovisual content in a strandardized way. It uses XML to store metadata, and can be attached to timecode in order to tag particular events, or sync lyrics to a song f.ex. Not widely used.
An open framework for delivery and consumption of digital multimedia content, and digital rights management. It defines an XML-based "Rights expression language". It has a concept of user and digital item. It does not need to be a MPEG content. Consists of 19 parts. Not widely used.
Was intended to get a 50% rate reduction with fixed fidelity compared to any existing video codecs. New features in AVC improved networking support and error robustness, improved transformations and entropy coding and improved prediction techniques. It includes the network abstracion layer (NAL), which is designed to provide network friendlyness. It can contain control data or encoded slices of video. It is encapsulated in an IP/UDP/RTP datagram and sent over the network. NAL gives higher robustness to packet loss by sending redundant frames in a lowe quality, or different macroblock interleaving patterns. Adds two new frame types, SP slice and SI slice. Uses intra prediction that removes the redundancy between neighbouring macroblocks. It includes variable block-size motion compensation that selects the best partition into subblocks for each 16x16 macroblock that minimize the coded residual and motion vectors. Thus the small changes can be predicted best with large blocks and vice versa. It uses seven different block sizes, which can save more than 15% compared to using 16x16 block size. It also uses in-loop deblocking filter which removes blocking artefacts. Uses two different entropy coding methods, CAVLC or CABAC. CABAC outperforms CAVLC by 5-15% in bit rate savings. It introduces different profiles, supporting 8-12 bit/sample and 4:2:0 -4:4:4, and some even support lossless region coding. It is not backwards compatible, and has a high encoder complexity. Transmission rate of AVC is 64kbps-150Mbps.
High resolution video (UHD) demanded a new codec. For UHD we need better compression efficiency and lower computational complexity. The goals of HEVC is to toble to coding efficiency. It is designed for parallel processing arcitectures for low power and complexity processing. It offers High spatial resolution, high framerates, and high dynamic range. It has an expanded colur gamut and expanded number of views. HDTV has 19201080p, with 50/60fps. UHD-1 has 38402150p with 50/60fps and 120Hz. 4K has 40962160 with 24/48fps. UHD-2 (8K) has 76804320p with 50/60fps and 120Hz. HEVC uses coding tree strucure instread of macroblocks, which is more flexible to block size. HEVC supports variable PB sizes from 64x64 to 4x4 samples. it has more 35 moeds for intra prediction, including 33 directional modes, and planar (surface fitting) and DC prediction. It has intrablock filters and sample adaptive offset filters (SAO). It has high complexity, but superior parallelization tools. HEVC outperforms VP8. It has some problems with a patent pool of companies to fix royalities.
2.3.2: What encoder do we choose?
Important question in deciding which codec to use can be:
Technical: Compression ratios Quality improvements 2D, 3D, multiview? Complexity (Power constraints, delay)
Business: Application or service driven Specialised codec? All-rounder? Standardised? Proprietary? Open source?
We once again return to the pyramid earlier, where recieved signal quality cannot be obtained with decreasing time delays, complexity and compression ratio.
2.3.3: Video encoding
188.8.131.52: Frame types
For video compression we use I, P and B frames to compress the raw video. Intra-frames (I-frames) are compressed frames from the original video frames. Predicted-frames (P-frames) are predicted frames between two I/P-frames. Bidirectionnally predicted frames (B-frames) can be seen as an interpolation between I and P-frames. A Group of Pictures (GOP) is the sequence of different frames between two I-frames.
Intra-coding is coding within a frame. For example for an I-frame we use an intra-coded frame. Inter-coding is between multiple frames, where we can use the original I-frame, and estimate what has moved in the next frame. Often in video two frames are very alike, and it would be wasteful to intra-code every frame.
184.108.40.206: Motion estimation
Ideally we would want motion information for each pixel in the image. But with a high resolution image this will be way too expensive. We could get the motion information about each homogenous region or object in the image. In practice, we often do motion estimation for each 16x16 macroblockWe investigate all possible position within a search window. We then compare the original macroblock with each possible position and keep the one with the lowest mean square error. The motion vector gives us the corresponding translation
With a model for affine motion we have translation, rotation and scaling parameters, 6 in total for 2d. This is abit complex, and in practice we often use motion vectors, which has two parameters. Backwards prediction predicts where the pixels in a current frame were in a past frame. Forwar prediction predicts where the pixels in a current frame will go to in a future frame. I-frames uses no temporal coding, while P-frames uses forward prediction and B frames uses both forward and backward prediction.
220.127.116.11: The hybrid video encoder
In the figure above we see the hybrid video encoder. For encoding I-frames, we do a DCT + quantization. This compressed frame is then coded using entropy-coding and is then ready. The frame is also fed to a reverse DCT + quantization and put in a frame buffer for predicting the P-frames. When encoding P-frames, we get a new frame from the raw video, and compute the residual between this frame and the previous I-frame. We then do a DCT + quantization on this followed by entropy coding, and we have one of the two parts for the P-frame. We also find the motion vectors needed to transform the previous I-frame into the current frame. These are caulculated, and then entropy coded.
2.4: Quality measures
2.4.1: Objective measures
Image quality can be assesed objectively by SNR or PSNR
Objective measures has some weaknesses. For example if we have one corrupted line, the PSNR will still be high, while most people would not see this as a good image. A tilt in the image will result in a low PSNR, while it would not be noticable for people.
2.4.2: Subjective measures
Determined based on the judgement of a large number of viewers and after elaborate tests comparing the reference and modified content. Results are summarized using mean opinion scores (MOS).
Some weaknesses of subjective measurements is that the viewing conditions may not be optimal, so that the focus may be on the outer things rather than the content to be tested
Part 3: QoE
Quality is defined by ISO9000 as "The degree to which a set of inherent characteristics fulfulls requirements" ITU-T defines it as "The overall acceptability of an application or service, as percieved subjectively by the end-user" Qualinet defines QoE as "... The degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the users personality and current state." phew..
The difference between QoE and QoS is shown in the below table || QoS || QoE || || Network-centric || User centric || || Delay, packet loss, jitter || Content dependant || || Transmission quality || Context dependant || || Content agnostic || Perception ||
QoE sets the user in focus, and as we know, users are not a homogeneous group.
3.2: Universal Multimedia Access (UMA)
The UMA addresses the delivery of multimedia resources under different and varying network conditions, diverse terminal equipment capabilities, specific user or creater prefererences and needs and usage environment conditions. UMA aims to guarantee unrestriced access to multimedia content from any device, through any network, independently of the original content format, with guarantees and efficiently and satisfying user preferences and usage environment conditions. The fulfillment of UMA requires that content is remotely accessible and searchable. Useful descriptions about the content and context (terminal capabilities) are needed. We need content delivery systems able to use this information to provide the intended value to the users independently of location, type of device. One crucial part is the content and context descriptors, which decides if the content needs to adapt before delivered to the end user. MPEG-7 and MPEG-21 are well suited for implementation of UMA-systems
The sene of being immersed, can be described as "being there" when consuming content. The duality in creating an experience is "the sense of realness" or "the sense of being there"