Wikipendium

Share on Twitter Create compendium Add Language
Edit History
Tools
  • Edit
  • History
  • Share on Twitter

  • Add language

  • Create new compendium
Log in
Table of Contents
  1. Part 1: Statistical characterization of audiovisiual information
    1. 1.1: Digitization
      1. 1.1.1: Digital Speech
      2. 1.1.2: Digital images/video
    2. 1.2: Compression
      1. 1.2.1: Signal decomposition
      2. 1.2.2: Signal quantisation
      3. 1.2.3: Signal coding
    3. 1.3: Manipulation
    4. 1.4: Communications
    5. 1.5: Filtering
  2. Part 2: Standards
    1. 2.1: Audio Compression
    2. 2.2: Image Compression
      1. 2.2.1: Lossless formats
        1. 2.2.1.1: Tiff
        2. 2.2.1.2: BMP
        3. 2.2.1.3: GIF
        4. 2.2.1.4: PNG
      2. 2.2.2: Lossy formats
        1. 2.2.2.1: JPEG
        2. 2.2.2.2: JPEG2000
    3. 2.3: Video Compression
      1. 2.3.1: MPEG
        1. 2.3.1.1: MPEG1
        2. 2.3.1.2: MPEG2
        3. 2.3.1.3: MPEG4
        4. 2.3.1.4.: MPEG7
        5. 2.3.1.5: MPEG-21
        6. 2.3.1.6: MPEG-I
        7. 2.3.1.7: H.264/AVC
        8. 2.3.1.8: H.265/HEVC
      2. 2.3.2: What encoder do we choose?
      3. 2.3.3: Video encoding
      4. 2.3.3.1: Frame types
        1. 2.3.3.2: Motion estimation
        2. 2.3.3.3: The hybrid video encoder
    4. 2.4: Quality measures
      1. 2.4.1: Objective measures
      2. 2.4.2: Subjective measures
  3. Part 3: QoE
    1. 3.1 QoE:
    2. 3.2: Universal Multimedia Access (UMA)
    3. 3.3: Immersion
  4. Part 4: Information theory
    1. 4.1: The Entropy and Data Compression
      1. 4.1.1: Data Compression Theorem
        1. 4.1.1.1: Proof of the Data Compression Theorem (DCT)
      2. 4.1.2: Further properties of the entropy
      3. 4.1.3: Huffman codes
      4. 4.1.4: Bounds on code length
      5. 4.1.5 The Kullback-Leibler divergence (relative entropy)
        1. 4.1.5.1: Example of the usage of the KL divergence
      6. 4.1.6: The Mutual information
      7. 4.1.7: The Data Processing Inequality
      8. 4.1.8: The Entropy rate
      9. 4.1.9 Differential Entropy
      10. 4.1.10: Run-length Coding
      11. 4.1.11: Arithmetic coding
      12. 4.1.12: Lempel-Ziv (LZW) coding
    2. 4.1.13: Vector Quantization
    3. 4.2: Rate Distortion Theory
      1. 4.2.1: Rate-distortion pair
      2. 4.2.1: Rate-distortion region
      3. 4.2.3: Rate distortion function
      4. 4.2.4: Examples
        1. 4.2.4.1: Binary source
        2. 4.2.4.1: Continues source
  5. Part 5: Machine learning
    1. 5.1: Neural Networks
      1. 5.1.1: Training
        1. 5.1.1.1: Batch
        2. 5.1.1.1: Mini-Batch
        3. 5.1.1.1: Stochastic gradient descent
        4. 5.1.1.2: Back propagation
      2. 5.1.2: Deep Naural Networks
      3. 5.1.3: Convolutional Neural Networks
      4. 5.1.4: Recurrent neural networks
    2. 5.2: Encoder-decoder networks
    3. 5.3: Attention
      1. 5.3.1: Tranformer networks
    4. 5.3: Auto-encoders
‹

TTT4135: Multimedia Signal Processing

Tags:
+

This is work in progress. Will be completed while reading to the exam

Part 1: Statistical characterization of audiovisiual information

1.1: Digitization

We want to digitize our signals for multiple reasons. Digital signals is more robust to noise introduced by storage and transmission. It allows for inclusion of error protection and makes encryption possible. It allows us to integrate speech and image so that the data is independent of the network. It makes packet transmission over the internet possible, and introduces the possibility of integration between transmission and switching. We can use digital signal processing on the digitized information, and it conforms with the development of moder digital circuits

Digitalization process

1.1.1: Digital Speech

  • Bandwidth: 50-3400 Hz
  • Sampling: 8 KHz
  • Representation: 8 bit Log PCM (13 bit linear PCM)
  • Dynamic range: 42 dB
  • Characterization:
    • Spectral components
    • Component description (phonemes)

Digital speech is often presented as a spectrum (frequency/Amplitude) or spectrogram (frequency/time/amplitude)

1.1.2: Digital images/video

  • Many proprietary formats, .flv, .avi, .mp4 etc. *Rates go from ~20kbps to 27 Mbps

Images has significant correlation between spectral components. The colors doesnt really matter, and conversion between different color spaces can be done fairly easy. There is significant correlation between neighbouring pixels. This means that we really dont have to treat every pixel for itself, but can instead look on a larger area of pixels. Natural images often exhibit large semi-uniform areas. Images are also non-stationary.

We have different tools for analysis and coding/compression of digital images.

  • Analysis:
    • Fourier transform in two dimensions
    • Subband decomposition
  • Coding/compression:
    • Block transforms (DCT)
    • Filter banks/Subband decomposition
      • Wavelets
      • General filters
    • Prediction/estimation
      • Motion compensation
    • Quantization
    • Entropy coding

1.2: Compression

For a source coder, the output should be efficient (compression), minimum distortion (fidelity) and robuts (reconstruction) In the above picture we see the tradeoffs when dealing with coding. Time delay, compression ratio and complexity will all affect the overall recieved signal quality. There are three fundamental steps in compression; signal decomposition, quantisation, and coding.

1.2.1: Signal decomposition

Removes redundancy. Correlation between consecutive samples -- Can be removed reversible (without loss of information). This step can also be known as the sampling step usually involfing an A/D convertion.

When dealing with audiovisual content there are two ways we can reduce the size of the files. Alot of the signal contains redundancy, for example big areas of uniformity in an image. This can be removed reversible, without loss of information. Some of the content also contains irrelevancy, for example frequencies above the human hearing range in audio content. This is an irreversible operation.

1.2.2: Signal quantisation

Removes irrelevancy. Non-perceivable information -- Can be removed irreversibly (with loss of information). The quantisation can be described by:

\begin{equation} Q(s)=D(E(s)) \end{equation}

1.2.3: Signal coding

Transmit a perceptually relevant and statistically non redundant signal over the channel to the receiver. Makes the compression efficient.

The source coder output should be:

  • Efficient (compression)
  • Minimal distortion (fidelity)
  • Robust (reconstruction)

1.3: Manipulation

1.4: Communications

1.5: Filtering

Part 2: Standards

2.1: Audio Compression

2.2: Image Compression

Image compression can be done either lossless or lossy. Lossless means that we can compress the images, and then decompress them and get the same image. With lossy compression, the images are altered in a way that it is not possible to recover the original images. We are mostly interested in lossy compression in this course, so the lossless formats will only be presented briefly

A picture is usally divided into slices, then macroblocks, then 8x8 pixel blocks.

The spectral redundancy can be removed by doing a color transform, for example YUV or grayscale conversion. The spatial redundancy reduction can be achieved with transform coding/subband decomposition and quantization. The temporal redundancy can be reduced by using motion compenstaion/estimation, predictive coding and temporal filtering. Entropy coding is a "free" way to get content more compressed. This can be achieved by using huffman coding or arithmetic coding.

2.2.1: Lossless formats

2.2.1.1: Tiff

Stands for "Tagged Image File Format". Can handle color depths from 1-24 bit. Uses Lempel-Ziv-Welch lossless compression. TIFF is a flexible, adaptable file format for handling images and data within a single file by including the header tags (size, definition, image-data arrangement, applied image compression). Can be a container holding lossy and lossless compressed images.

2.2.1.2: BMP

Short for "Bitmap". Developed by windows. Stores two-dimensional digital images both monochrome and color, in various color depths, and optionally with data compression, alpha channels and color profiles.

2.2.1.3: GIF

Stands for "Graphics Interchange Format". Can only contain 256 colors. Commonly used for images on the web and sprites in software programs

2.2.1.4: PNG

Stands for "Portable Network Graphics". Supports 24-bit color. Supports alpha channel, so that an image can have 256 levels of transparency. This allows designers to fade an image to a transparent background rather than a specific color.

2.2.2: Lossy formats

2.2.2.1: JPEG

JPEG is waveform based, and uses DCT. The process of JPEG encoding is as follows; First we do a color transform to reduce spectral redundancy, from RGB to YCbCr. We segment into 8x8 blocks, and do a DCT of each block. For JPEG we have hard-coded quantization tables that quantizes the blocks. We do a zigzag coefficient scanning, and code with huffman or arithmetic coding followed by run-length coding

2.2.2.2: JPEG2000

JPEG2000 uses wavelets, and is used mainly for digital cinema. JPEG2000 is meant to complement, and not to replace JPEG. The most interesting additions is region of interest coding (ROI), better error resilience and more flexible progressive coding. It also allows lossy and lossless compression in one system. It provides better compression at low bit-rates. It is better at comput images and graphics, and better under error-prone conditions. It uses the discrete wavelet transform (DWT) instead of DCT for signal decomposition. This makes the artefacts from JPEG2000 "ringing", based on the removal of the high frequency components, instead of JPEGs blockiness.

Images may contain 1-16384 components, and sample pixel-values can be either signed or unsigned. It supports bit depth up to 38 bits. The first step is preprocessing of the images. The input images are partitioned into rectangular, non-overlapping tiles of equal size. The tile-size can vary from one pixel to the whole image. We get the DC offset by subtracting $2^{B-1}$ from it, which is half the maximum value. B is bit depth. Using the DWT we can get a multi-resolution representation of the image. It separates the image into High and low-bands. Most information is stored in the low frequency content, so the low band is often filtered several times to get better representation. It is then quantized with a uniform quantizer with central deadzone, if we want lossy compression, or not if we want lossless. We then code the bitstream using EBCOT. Each subband is partitioned into a set of blocks, with the same size. The blocks are encoded independently. The final bitstream is composed of a collection of "layers"

The JPEG2000 is scalable, in the notion that it can achieve coding of more than one quality and/or resolution simultaneously. The coding is scaleable as well, as it generates a bitstream that we can get different quality/resolution from the bitstream

2.3: Video Compression

In video we have both spatial and temporal redundancy.

2.3.1: MPEG

MPEG is a video and audio codec. For signal decomposition, all standards use DCT + quantization. It uses motion compensation as well as lossless entropy coding for motion vectors and quantized DCT samples. The encoder can trade-off between quality and bitrate. The default quantization matrix can be multiplied by a quantization parameter in order to obtain different quantization steps. Small steps gives high quality and bitrate, while large quantization steps git low quality and rate. The resulting bitrate cannot be known accurately from the quantization parameter. This is because the bitrate is dependant on spatial and temporal frame activity and on the GOP structure. In practice, constant bitrate and constant quality cannot be obtained at the same time

2.3.1.1: MPEG1

The MPEG-1 consists of five parts; system, video, audio, conformance testing and reference software. The goal was interactive movies on CD (1.4Mbit/s) with VHS quality video and CD-quality audio. MP3 is part of the MPEG-1 audio. It is also used in cable TV and DAB. The standard only specifies the bitstream-syntax and decoder, so as long as an encoder conforms to the bitstream-syntax it can be produced on your own.

mpeg1

2.3.1.2: MPEG2

Evolved from the shortcomings of MPEG1. Consists of nine parts, where part 1-5 is the MPEG-1 counterparts. MPEG2 brings better support for audiovisual data transport and boradcast applications. Different modes for motion compensation is used. It supportes support for interlaced video, and adds vertical scan mode for this. It also adds 4:2:2 subsampling in addition to 4:2:0. It defines profiles and levels for specific types of application. Profiles is the quality of the video, while levels is the resolution of the video. FPS ranges from 30-60 and rates is 4-80Mbps It has better scalability in terms of spatial(resolution), temporal (fps) and quality (SNR), includes advanced audio coding (AAC), Digital media storage commands and control, Real-time interfaces to transport stream decoders. It is used on DVDs, and digital broadcasting.

2.3.1.3: MPEG4

Absorbs parts from MPEG1/2 and has 21 parts. It allows for more content types, like graphics, animation, symbolic music, subtitles and more tools for networking and storage. It feautres the advanced video coding (AVC) It is used for HDTV broadcasting, IP-based TV, gaming, mobile video applications, streaming, video conferencing etc. It is suitable for mobile environments with limited link capacitym thanks to high compression efficiency and scalability. It is supported in several web-applications, but most application only supports the video and audio parts.

2.3.1.4.: MPEG7

This standard is designed for multimedia content description. It is useful in digital library, search and boradcast selection applications. Makes it possible to do text-based search in audiovisual content in a strandardized way. It uses XML to store metadata, and can be attached to timecode in order to tag particular events, or sync lyrics to a song f.ex. Not widely used.

2.3.1.5: MPEG-21

An open framework for delivery and consumption of digital multimedia content, and digital rights management. It defines an XML-based "Rights expression language". It has a concept of user and digital item. It does not need to be a MPEG content. Consists of 19 parts. Not widely used.

2.3.1.6: MPEG-I

A collection of standards to digitally represent immersive media consisting of 26 parts.

2.3.1.7: H.264/AVC

Was intended to get a 50% rate reduction with fixed fidelity compared to any existing video codecs. New features in AVC improved networking support and error robustness, improved transformations and entropy coding and improved prediction techniques. It includes the network abstracion layer (NAL), which is designed to provide network friendlyness. It can contain control data or encoded slices of video. It is encapsulated in an IP/UDP/RTP datagram and sent over the network. NAL gives higher robustness to packet loss by sending redundant frames in a lowe quality, or different macroblock interleaving patterns. Adds two new frame types, SP slice and SI slice. Uses intra prediction that removes the redundancy between neighbouring macroblocks. It includes variable block-size motion compensation that selects the best partition into subblocks for each 16x16 macroblock that minimize the coded residual and motion vectors. Thus the small changes can be predicted best with large blocks and vice versa. It uses seven different block sizes, which can save more than 15% compared to using 16x16 block size. It also uses in-loop deblocking filter which removes blocking artefacts. Uses two different entropy coding methods, CAVLC or CABAC. CABAC outperforms CAVLC by 5-15% in bit rate savings. It introduces different profiles, supporting 8-12 bit/sample and 4:2:0 -4:4:4, and some even support lossless region coding. It is not backwards compatible, and has a high encoder complexity. Transmission rate of AVC is 64kbps-150Mbps.

2.3.1.8: H.265/HEVC

High resolution video (UHD) demanded a new codec. For UHD we need better compression efficiency and lower computational complexity. It is designed for parallel processing arcitectures for low power and complexity processing. It offers High spatial resolution, high framerates, and high dynamic range. It has an expanded colur gamut and expanded number of views.

The design of HEVC had the following goals:

  • Doubling the coding efficiency
    • 50% better quality at the same bitrate as compared to H.264
    • 50% lower bit rate at the same quality as compared to H.264
  • Parallel processing architectures
    • Low power processing
    • Low complexity

Some specifications

HDTV has 1920x1080p, with 50/60fps. UHD-1 has 3840x2150p with 50/60fps and 120Hz. 4K has 4096x2160 with 24/48fps. UHD-2 (8K) has 7680x4320p with 50/60fps and 120Hz. HEVC uses coding tree strucure instread of macroblocks, which is more flexible to block size. HEVC supports variable PB sizes from 64x64 to 4x4 samples. it has more 35 modes for intra prediction, including 33 directional modes, and planar (surface fitting) and DC prediction. It has intrablock filters and sample adaptive offset filters (SAO). It has high complexity, but superior parallelization tools. HEVC outperforms VP8. It has some problems with a patent pool of companies to fix royalities.

2.3.2: What encoder do we choose?

Important question in deciding which codec to use can be:

Technical: Compression ratios Quality improvements 2D, 3D, multiview? Complexity (Power constraints, delay)

Business: Application or service driven Specialised codec? All-rounder? Standardised? Proprietary? Open source?

We once again return to the pyramid earlier, where recieved signal quality cannot be obtained with decreasing time delays, complexity and compression ratio.

Comparison between video standards

Featured standard MPEG-2 H.264/AVC H.265/HEVC
Macroblock size 16x15 (frame mode) 16x8 (field mode) 16x16 CTUs NxN samples (N=16,32,64)
Block size 8x8 16x16, 8x16, 16x8, 8x8, 4x8, 8x4, 4x4 Tree structured, from 64x64 to 4x4 samples
Intra prediction No Spatial domain, 8 modes 33 directional modes planar (surface fitting) DC prediction (flat)
Transform 8x8 DCT 8x8, 4x4 Integer DCT 4x4, 2x2 Hadamard 32x32 to 4x4 Integer DCT, tree structuresd, maximum of 3 levels
Quanitzation Scalar Scalar Scalar, rate dirstortion optimized quantization
Entropy coding VLC VLC, CAVLC, CABAC CABAC
Pel accuracy 1/2 pel 1/4 pel 1/4 pel
Reference picture One picture Multiple pictures Multiple pictures
Weighted prediction No Yes Yes
Deblocking filter No Intra block Intra block, SAO
Picture types I, P, B I, P, B, SI, SP IDR, GPB (Generelized P and B) picture, Referenced B picture and non-referenced B picture
Profiles 5 profiles 7 profiles 3 profiles
Transmission 2-15 Mbps 64 Kbps-150Mbps unknown?
Encoder complexity Low Medium High, Superior parallelization tools
Compatible with previous standards Yes No With H.264?

2.3.3: Video encoding

2.3.3.1: Frame types

For video compression we use I, P and B frames to compress the raw video.

Intra-frames (I-frames)
are compressed frames from the original video frames and the least compressible. Holds information of a complete image.
Predicted-frames (P-frames)
are predicted frames between two I/P-frames and more compressible than I‑frames. Only holds the changes in the image from the previous frame.
Bidirectionnally predicted frames (B-frames)
can be seen as an interpolation between I and P-frames and get the highest amount of data compression. Holds information about both the preceding and following frames.

A Group of Pictures (GOP) is the sequence of different frames between two I-frames. Intra-coding is coding within a frame. For example for an I-frame we use an intra-coded frame. Inter-coding is between multiple frames, where we can use the original I-frame, and estimate what has moved in the next frame. Often in video two frames are very alike, and it would be wasteful to intra-code every frame.

2.3.3.2: Motion estimation

Ideally we would want motion information for each pixel in the image. But with a high resolution image this will be way too expensive. We could get the motion information about each homogenous region or object in the image. In practice, we often do motion estimation for each 16x16 macroblockWe investigate all possible position within a search window. We then compare the original macroblock with each possible position and keep the one with the lowest mean square error. The motion vector gives us the corresponding translation

With a model for affine motion we have translation, rotation and scaling parameters, 6 in total for 2d. This is abit complex, and in practice we often use motion vectors, which has two parameters. Backwards prediction predicts where the pixels in a current frame were in a past frame. Forwar prediction predicts where the pixels in a current frame will go to in a future frame. I-frames uses no temporal coding, while P-frames uses forward prediction and B frames uses both forward and backward prediction.

2.3.3.3: The hybrid video encoder

hybrid_encoder

In the figure above we see the hybrid video encoder. For encoding I-frames, we do a DCT + quantization. This compressed frame is then coded using entropy-coding and is then ready. The frame is also fed to a reverse DCT + quantization and put in a frame buffer for predicting the P-frames. When encoding P-frames, we get a new frame from the raw video, and compute the residual between this frame and the previous I-frame. We then do a DCT + quantization on this followed by entropy coding, and we have one of the two parts for the P-frame. We also find the motion vectors needed to transform the previous I-frame into the current frame. These are caulculated, and then entropy coded.

2.4: Quality measures

2.4.1: Objective measures

Image quality can be assesed objectively by SNR or PSNR

$\sigma_e ^2 = \frac{1}{MN}\sum_{M=1}^M\sum_{n=1}^N|u(m,n) - û(m,n)|^2$

$PSNR_{dB} = 10Log_{10}\frac{255^2}{\sigma_e ^2}$

Objective measures has some weaknesses. For example if we have one corrupted line, the PSNR will still be high, while most people would not see this as a good image. A tilt in the image will result in a low PSNR, while it would not be noticable for people.

2.4.2: Subjective measures

Determined based on the judgement of a large number of viewers and after elaborate tests comparing the reference and modified content. Results are summarized using mean opinion scores (MOS).

Some weaknesses of subjective measurements is that the viewing conditions may not be optimal, so that the focus may be on the outer things rather than the content to be tested

Part 3: QoE

3.1 QoE:

Quality is defined by ISO9000 as "The degree to which a set of inherent characteristics fulfulls requirements" ITU-T defines it as "The overall acceptability of an application or service, as percieved subjectively by the end-user" Qualinet defines QoE as "... The degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the users personality and current state." phew..

The difference between QoE and QoS is shown in the below table || QoS || QoE || || Network-centric || User centric || || Delay, packet loss, jitter || Content dependant || || Transmission quality || Context dependant || || Content agnostic || Perception ||

QoE sets the user in focus, and as we know, users are not a homogeneous group.

3.2: Universal Multimedia Access (UMA)

The UMA addresses the delivery of multimedia resources under different and varying network conditions, diverse terminal equipment capabilities, specific user or creater prefererences and needs and usage environment conditions. UMA aims to guarantee unrestriced access to multimedia content from any device, through any network, independently of the original content format, with guarantees and efficiently and satisfying user preferences and usage environment conditions. The fulfillment of UMA requires that content is remotely accessible and searchable. Useful descriptions about the content and context (terminal capabilities) are needed. We need content delivery systems able to use this information to provide the intended value to the users independently of location, type of device. One crucial part is the content and context descriptors, which decides if the content needs to adapt before delivered to the end user. MPEG-7 and MPEG-21 are well suited for implementation of UMA-systems

3.3: Immersion

The sene of being immersed, can be described as "being there" when consuming content. The duality in creating an experience is "the sense of realness" or "the sense of being there"

Part 4: Information theory

It is of both practical and theoretical interest to understand how much we can compress any given data source.

In the lossless case this should be some number that tells us that this data source has a representation that is a factor $N$ smaller, or just $P$ % of the original size. This representation may be purely theoretical, involving computations too complex to perform in practice.

It would still be valuable to know the absolute limit on how much a data source can be compressed.

For the lossy case we would like to know the compression limit given an allowable amount of distortion. Or vice versa -- how much distortion must we allow if the compression needs to be smaller than some factor $N$. The field of information theory enables us to answer all of the above questions. From Wikipedia:

Information theory is the scientific study of the quantification, storage, and communication of information.

  • Quantification: We need descriptions of information sources. For this we use statistics, and specify information sources in terms of distributions over alphabets and symbol sequences.
  • Storage: How much space is needed to store $N$ bytes of information from a given source. This is the main topic of our lectures.
  • Communication: Given a channel with additive noise; what is the capacity in symbols per seconnd, that can be transmitted with an arbibtrarily low probability of error?

The concept of information lies at the heart of Information Theory. We need to be able to specify what we mean by information, and how to measure and quantify it.

When we watch the news, read a book of listen to the radio - we can imagine that we receive information if we learn something we didn't already know. Depending on how long, or how may pages, was needed to convey that information to us, we can sort of say something about how efficiently that information was conveyed.

As a thought experiment we can imagine trying to shorten down a news clip to its bare essentials - just the actual information. How few sentences can one use? When do we hit a limit where using fewer than say $N$ words, will lead to loss or inaccuracies in the information?

To be able to formalize this process we need to get down to the most basic of information sources - the flip of a coin.

4.1: The Entropy and Data Compression

Information and entropy

Imagine that instead of watching your favorite news show, you are watching a person flip a coin. Altough mundane, the result of such a flip is still information. Watching the coin flip turning up as heads or tails is something you didn't know beforehand.

We assume that a sequence of coin tosses are identically and independently distributed (iid), with probabilities $p$ for heads and $1-p$ for tails. The distribution $\{p, 1-p\}$ describes the information source (the toins tosses) completely.

Pondering

  1. Assume the probability $p$ of heads is 1.0. How much information do you think you get by observing heads after a toss?
    • A good guess would be zero. Since we know that the toss is going to be heads the information we get when it happens is nothing.
  2. How much information would you receive if a toss with the same coin yielded tails?
    • If something is impossible, the corresponding information should be infinite. Informally, if we have coins with probabilities 1/10, 1/1000, 1/1000000 and so on for getting heads, we would think the smaller odds yields more information. As $p\rightarrow 0$ the informastion should grow to infinity.
  3. If two coins have $p$ equal to 0.1 and 0.5 respectively, which coin would yield the most surprising tosses on average?
    • $p=0.1$ yields more information, but happens more rarely. $1-p=0.9$ is less informative and happens more often. Although we cannot say for sure just yet, intuition tells us that coin tosses with $p=0.5$ is more informative on average.
  4. If you toss two coins independently, what should the average information be when compared to a single toss?
    • Two independent coin tosses should yield double the information that a single coin toss should. With fair coin, seeing eg. {heads,heads} has a probability of 1/4, while just seing heads for a single, fair coin toss is 1/2.

We want a measure of information that is expressed as a function $f(p)$ for a coin having probability $p$ for heads. We also want the following to hold:

  • $f(1) = 0$. Things that are sure yields no information
  • $f(0) = \infty$. Impossible outcomes have infinite information
  • $f(p_1,p_2) = f(p_1)+f(p_2)$ if tosses are independent

The function that makes all three constraints hold is $f(p) = -\log p$. Commonly we use the 2-logarithm,

\begin{equation} f(p) = \log_2(p) = \frac{\log p}{\log 2} \end{equation}

The reason for using the 2-logarithm is that we want to measure information in bit.

Having defined a measure of information we can introduce the most important quantity in information theory -- the Entropy.

This is defined as the expected, or average, information for an information source, and is defined as

\begin{equation} H(p) = -p \log_2 p -(1-p)\log_2 (1-p) \end{equation}

for the coin toss, or more generally, the binary distribution.

In general, for a finite, discrete random variable $X$ having a set of outcomes $\{x_i\}$ with probabilities $p_i = P_X(X = x_i)$, we have

\begin{equation} H(X) = - \sum_i p_i \log_2 p_i = E\left[ -\log_2 p_i \right] \end{equation}

where $E[]$ is the expectation operator. The above equation gives the entropy of the signal, in other words the information content in the signal. With a code we can get an entropy greater than or equal to the entropy, but never less. Code efficiency can be measured with $\eta = \frac{H(s)}{L}$. We can increase the efficiency of a code by using source expansion, combining two consecutive symbols of the source, at the cost of added complexity or time delay.

Data compression

Before we look at the properties and extensions of the entropy, we state what is to us the most important result regarding the entropy of an information source -- Data Compression.

We need some definitions before we make any statements about data compression. Let $X^n$ be a random variable that produces bit sequences $x^n$ of length $n$, where each bit in the sequence is iid. with

\begin{align} P(X=0) &= p \ P(X=1) &= 1-p \end{align}

Assume that $n=2$. Then all the sequences that can be generated are $x^n\in\{00,01,10,11\}$. We now define a code for $X^n$ by assigning every $x^n$ to a codeword $c$. Example

$$ \begin{align} 00 & \rightarrow 0 \\ 01 & \rightarrow 10 \\ 10 & \rightarrow 110 \\ 11 & \rightarrow 111 \\ \end{align} $$

This is a prefix-code, which means that no codeword is the prefix of another codeword, hence, a coded sequence can be uniquely decoded.

Example. Let ut code the bit sequence 101101001001 using our new code: \begin{equation} 101101001001 \rightarrow 110-111-10-0-110-10 \end{equation}

We seem to have coded a 12 bit sequence into a 14 bit sequence, which doesn't make much sense from a compression perspective. However, another sequency like 000010001000 is coded like

$$ \begin{equation} 000010001000 \rightarrow 0-0-110-0-110-0 \end{equation} $$ which turns a 12 bit sequence into an 10 bit sequence.

Let $l:X^n\rightarrow \mathbb{N}$ be the length of a code for a given codeword $x^n$. For our example

$$ \begin{align} l(00) = 1 \\ l(01) = 2 \\ l(10) = 3 \\ l(11) = 3 \end{align} $$

The expected length of a code given a distribution over $X^n$ is

\begin{equation} L = E_{X^n}[l(x^n)] = \sum_{x^n} P(X^n = x^n)l(x^n) \end{equation}

4.1.1: Data Compression Theorem

Let $X$ be a random variable over some alphabet $\mathcal A$. Let $\epsilon > 0$ be some arbitrary small number. It then exist some sufficiently large $n$ and a code that maps $x^n$ into binary strings such that the mapping is one-to-one and

$$ \begin{align} E \left[ \frac{1}{n} l(X^n) \right] \leq H(X) + \epsilon \end{align} $$

Let's use a concrete example to show the power of this theorem. Let $X$ be a binary source (zeros and ones) with

$$ \begin{align} P(X=0) = 3/4 \\ P(X=1) = 1/4 \end{align} $$

We compute the entropy of $X$:

p = 3/4
H = -p*np.log2(p) - (1-p)*np.log2(1-p)
print("The entropy is", H, "vs the code rate", L/2)

The theorem now tells us that we can find an $n$ and a code so that the average length of the coded sequences $x^n$ is arbitrarily close to the entropy.

The theorem doesn't tell us how to find these codes, or even how large $n$ should be. In this sense it is an non-constructive theorem. In practice however, getting close enough to $H(X)$ for all practical purposes is well within our reach.

4.1.1.1: Proof of the Data Compression Theorem (DCT)

We will not give a full, formal proof of the theorem as we would need to introduce a lot of mathematical concepts that you would otherwise have no use for. We will instead give a rough sketch that hopefully gives some insight into the material. For those interested, you should consult chapter three of Elements of Information Theory by Cover and Thomas for a proper proof (online edition available through the NTNU Library).

The main observation behind the DCT is that the set of all sequences of $n$ symbols will be divided into $typical$ and $non-typical$ sets. The typical sets will be the sets that has roughly (assuming a binary source $X$) $p\cdot n$ zeros and $(1-p)\cdot n$ ones. All other sequences will non-typical.

Asympotic Equipartition Property (AEP)

Formally we say that a sequence is typical if \begin{equation} 2^{-n(H(X)+\epsilon)} \leq P(x_1,x_2,\ldots,x_n) \leq 2^{-n(H(X)-\epsilon)} \end{equation} and we denote the set of all typical sequences as $A_\epsilon^{(n)}$

Let us look at the lower bound $2^{-n(H(X)+\epsilon)}$ in a little more detail. We can write

$$ \begin{align} & 2^{-n(H(X)+\epsilon)} \\ = & 2^{-n(-p\log_2 p -(1-p)\log_2(1-p) + \epsilon)} \\ = & 2^{ np\log_2 p + n(1-p)\log_2(1-p) - n\epsilon)} \\ = & 2^{ np\log_2 p} 2^{n(1-p)\log_2(1-p)} 2^{- n\epsilon)} \\ = & p^{np} (1-p)^{n(1-p)} 2^{-n\epsilon} \end{align} $$

In the same way we can write

$$ \begin{equation} 2^{-n(H(X)-\epsilon)} = p^{np} (1-p)^{n(1-p)} 2^{n\epsilon} \end{equation} $$

so

$$ \begin{equation} p^{np} (1-p)^{n(1-p)} 2^{-n\epsilon} \leq P(x_1,x_2,\ldots,x_n) \leq p^{np} (1-p)^{n(1-p)} 2^{n\epsilon} \end{equation} $$

We see that the sequences defined by the bound are exactly the sequences with roughly $np$ zeros and $n(1-p)$ ones.

You can now show that the size of the typical set is

$$ \begin{equation} |A_\epsilon^{(n)}| \leq 2^{n(H(X)+\epsilon)} \end{equation} $$

We know that we need at least $\lfloor \log_2 M \rfloor + 1$ bits to represent all numbers from 1 through $M$. Taking $\log_2$ on both sides of the inequaliy above then yields

$$ \begin{align} \log_2 |A_\epsilon^{(n)}| & \leq n(H(X)+\epsilon) \\ \frac{1}{n} \log_2 |A_\epsilon^{(n)}| &\leq H(X) + \epsilon \end{align} $$

This means we only need $n(H(X) + \epsilon)$ bits to represent typical length $n$ sequences.

What about the non-typical sequences then? Again, "it can be shown" that

$$ \begin{equation} P(A_\epsilon^{(n)}) \geq 1 - \epsilon \end{equation} $$

which means that non-typical sequences are extremely unlikely and can be represented with $k \geq n$ bits without it affecting our performace.

4.1.2: Further properties of the entropy

The Joint Entropy

Given two random variables $X$ and $Y$ we can define their joint entropy as $$ \begin{align} H(X,Y) & = -\sum_m \sum_n \log_2 P(X=x_m,Y=y_n) \log_2 P(X=x_n, Y=y_m) \\ & = -\sum_m \sum_n P(x_m,y_n) \log_2 P(x_m,y_n) \\ & = E_{X,Y}[-\log_2 P(x,y)] \end{align} $$

The Conditional Entropy

Given two random variables $X$ and $Y$ with joint distribution $P(X,Y)$ we can define the conditional entropy $H(Y|X)$ as $$ \begin{align} H(Y|X) & = -\sum_m \sum_n \log_2 P(X=x_m,Y=y_n) \log_2 P(Y=y_m|X=x_n) \\ & = -\sum_m \sum_n P(x_m,y_n) \log_2 P(y_n|x_m) \\ & = E_{X,Y}[-\log_2 P(y|x)] \end{align} $$

The Chain rule for entropy

From probability theory we know that the joint distribution $P(X,Y)$ can be written \begin{equation} P(X,Y) = P(Y|X)P(X) \end{equation}

We can now easily show that

$$ \begin{align} H(X,Y) & = E_{X,Y}[-\log_2 P(x,y)] \\ & = E_{X,Y}[-\log_2 P(y|x)P(x)] \\ & = E_{X,Y}[-\log_2 P(y|x) - \log_2 P(x)] \\ & = E_{X,Y}[-\log_2 P(y|x)] + E_{X,Y}[-\log_2 P(x)] \\ & = H(Y|X) + H(X) \\ & = H(X|Y) + H(Y) \end{align} $$

4.1.3: Huffman codes

We will now present one method to find a set of codes for a set of sequences $x^n$ and their probabilities $P(X^n=x^n)$. The method is refered to as Huffman coding and has the nice property that it is optimal in the sense of yielding the lowest average length for any code given any $n$ and $P$. It should be noted though, that increasing $n$ would allow us to find a Huffman code with even lower average length.

The algorithm goes as follows:

  1. Find the two symbols with lowest probability (here $X=$4 and 5)

  2. Assign a 0 and 1 one to each of the symbols

  3. Sum their probabilities and insert a new symbol (4,5) into the sorted table

  4. Repeat the procedure with symbols $X=$2 and 3, getting a new symbol (2,3)

  5. At the third step we assign 0 to (4,5) and 1 to $X=1$, and get a new symbol ((4,5),1)
  6. At the final step we assign 0 to ((4,5),1) and 1 to (2,3)

Pondering

  • Is this a prefix code?
    • Yes, symbals can be uniqly identified
  • What is the average length of a coded symbol for this code
    • 2.3
  • What is the entropy of the source $X$?
    • 2.29
  • Create a Huffman code for the alphabet $\{1,2,3,4\}$ with probabilities $P=\{1/2, 1/4, 1/8, 1/8\}$. Compute entropy and average code length.

4.1.4: Bounds on code length

It can be shown (see chapter five in Elements of Information Theory by Cover and Thomas) if you want some proper proofs), that the optimal code lengths for an information source $X$ is

\begin{equation} l_i^{\star} = -\log_2 P(x_i) \end{equation}

if we just could use code words of non-integer length that is. Instead we have to use code words with length

\begin{equation} l_i = \lceil -\log_2 P(x_i) \rceil \end{equation}

(Here $\lceil x \rceil$ rounds x up to the closest integer, eg. $\lceil 2.1 \rceil = 3$).

We can show the following bound on code lengths, $$ \begin{align} & -\log_2 P(x_i) & \leq & l_i & \leq & -\log_2 P(x_i) + 1 \\ \Rightarrow & -P(x_i)\log_2 P(x_i) & \leq & P(x_i)l_i & \leq & -P(x_i)\log_2 P(x_i) + P(x_i) \\ \Rightarrow & \sum_i -P(x_i)\log_2 P(x_i) & \leq & \sum_i P(x_i)l_i & \leq & \sum_i -P(x_i)\log_2 P(x_i) + P(x_i) \\ \Rightarrow & H(X) & \leq & L & \leq & H(X) + 1 \end{align} $$

4.1.5 The Kullback-Leibler divergence (relative entropy)

We will now introduce a very important quantity, both in the field of Information Theory and Statistics. The Kullback-Leibler (or KL) divergence is a measure of how much a distribution differs from a reference distribution, and is defined as

$$ \begin{align} D(P\|Q) & = \sum_{x\in \cal X} P(x)\log_2 \frac{P(x)}{Q(x)} \\ & = E_P \left[\log_2 \frac{P(x)}{Q(x)}\right] \end{align} $$

Some important properties of the KL divergence are

  1. We always have that $D(P\|Q) \geq 0$
  2. $D(P\|Q) = 0$ iff $P=Q$.
  3. It is asymmetric, that is, $D(P\|Q) \neq D(Q\|P)$ in general
  4. If $H(P,Q)$ is the cross entropy of $P$ and $Q$, and $H(P)$ is the entropy of $P$ (which is the same as the cross-entropy of P with itself) then $D(P\|Q)=H(P, Q)-H(P)$.

We will quickly prove the first two properties. The third property can easily be proven by finding a suitable example. Before proving (1,2) we need the following inequality

\begin{equation} \log x \leq x - 1~\forall x \end{equation}

Proof

A quick plot is given below in stead of a proper proof.

logx<=x-1

We can now prove properties 1 and 2 (I will use the natural logarithm in the proof for simplicity. This does not affect the result):

$$ \begin{align} D_e(P\|Q) & = \sum_{x\in\cal X} P(x)\log \frac{P(x)}{Q(x)} \\ & = \sum_{x\in\cal X} -P(x)\log \frac{Q(x)}{P(x)} \\ & \geq \sum_{x\in\cal X} -P(x)\left( \frac{Q(x)}{P(x)}-1\right)\\ & = \sum_{x\in\cal X} \left( P(x)-Q(x)\right)\\ & = 0 \end{align} $$

The if part of property 2 is trivially true since \begin{equation} \log\frac{P(x)}{Q(x)} = \log 1 = 0 \end{equation} if $P(x) = Q(x)$ for all $x$. The only if part can be seen from the inequality, which is only tight ($\log x = x - 1$) for $x=1$.

4.1.5.1: Example of the usage of the KL divergence

Assume that we have an information source $X$, and instead of knowing the exact symbol probabilities $P(X_i)$, we only have an estimate $Q(x_i)$. Based on this estimate we create a code

$$ \begin{equation} l_i = \lceil -\log_2 Q(x_i) \rceil \end{equation} $$

for instance by using Huffman coding. We can now give an estimate of how far we will be from the true optimal coding rate,

$$ \begin{align} & -\log_2 Q(x_i) & \leq & l_i & \leq & -\log_2 Q(x_i) + 1 \\ \Rightarrow & \sum_i -P(x_i)\log_2 Q(x_i)& \leq& L & \leq & \sum_i -P(x_i)\log_2 Q(x_i) + 1 \\ \Rightarrow & \sum_i -P(x_i)\log_2 \frac{P(x_i)}{P(x_i)}Q(x_i) & \leq & L & \leq & \sum_i -P(x_i)\log_2 \frac{P(x_i)}{P(x_i)}Q(x_i) + 1 \\ \Rightarrow & H(X) + D(P\|Q) & \leq & L & \leq & H(X) + D(P\|Q) + 1 \end{align} $$

4.1.6: The Mutual information

The last informastion theoretic quantity we will discuss is the mutual information. It is a quantity used with joint distributions $P(X,Y)$ and is defined as

$$ \begin{align} I(X;Y) & = \sum_{i,j} P(x_i,y_j) \log_2 \frac{P(x_i,y_j)}{P(x_i)P(y_j)} \\ & = D(P(X,Y)\|P(X)P(Y)) \\ & = E\left[\log_2 \frac{P(x_i,y_j)}{P(x_i)P(y_j)}\right] \end{align} $$

Let us list some properties of the mutual information before further discussion

$$ \begin{align} I(X;Y) & = H(X) - H(X|Y) \\ I(X;Y) & = H(Y) - H(Y|X) \\ I(X;Y) & = H(X) + H(Y) - H(X,Y) \\ I(X;Y) & = I(Y;X) \\ I(X;X) & = H(X) \\ \end{align} $$

4.1.7: The Data Processing Inequality

We now give an application of the mutual information. The Data Processing inequality say that no amount of data processing can improve the inference based on the data.

Before we state the inequality, we need to define a Markov chain. Let $X_1$, $X_2$, ... be a series of random variable that are observed in that order. Let

\begin{equation} P(x_k|x_{k-1},x_{k-2}\ldots) \end{equation}

be the probability of $X_k = x_k$ given all the previous observations. If we can write

\begin{equation} P(x_k|x_{k-1},x_{k-2}\ldots) = P(x_k|x_{k-1}) \end{equation}

for all $k$, then $X_1$, $X_2$, ... is a Markov chain.

Now consider the Markov chain $X \rightarrow Y \rightarrow Z$, where $X$ is the original or "raw" data that is unobserved, $Y$ is the observed version of $X$ and $Z$ is some processed version of $Y$, $Z = g(Y)$.

Examples:

  • $X$ is a communication signal (zeros and ones) and $Y$ is $X$ plus some additiver noise
  • $X$ is a label from the set {Cat, Dog, Horse, Capybara}, and $Y$ is a blurry image of the animal

The Data Processing Inequality now states: If we have a Markov chain $X\rightarrow Y\rightarrow Z$, then $I(X;Y) \geq I(X;Z)$.

Again - this means that no amount of processing of the data $Y$ can increase the information gained about $X$.

4.1.8: The Entropy rate

The concept of entropy is not limited to iid. random variables, but can also be defined for stochastic processes. Let $X_1,X_2,X_3,\ldots$ be a stochastic process. Since such a process can be arbitrarily long, we define the entropy rate as

\begin{equation} H({\cal X}) = \lim_{n\rightarrow\infty} \frac{1}{n}H(X_1,X_2,\ldots,X_n) \end{equation}

if the limit exists. Examples:

If iid. random variables $X_i$ we have

$$ \begin{align} H({\cal X}) & = \lim_{n\rightarrow\infty} \frac{1}{n} H(X_1,X_2,\ldots,X_n) \\ &= \lim_{n\rightarrow\infty} \frac{1}{n} \sum_{i=1}^n H(X_i) \\ & = H(X_1) \end{align} $$

We are now read to consdider coding of arbitrarily long sequences.

4.1.9 Differential Entropy

Let $X$ by a continuous random variable (r.v.) with probability density function (pdf) $f_X(x)$. Then the differential entropy is defined as: \begin{equation} h(X) = -\int_\Omega f_X(x)\log f_X(x) dx \end{equation}

The definition of the differential entropy can be motivated from the discrete entropy as follows: Discretized continuous distribution

Let

$$ \begin{equation} p_i = \int_{i\Delta}^{(i+1)\Delta} f_X(x) dx = f_X(x_i)\Delta \end{equation} $$ define a discrete rv. $X^\Delta$, and write the discrete entropy as

$$ \begin{align} H(X^\Delta) & = -\sum_i p_i \log p_i \\ & = -\sum_i f_X(x_i)\Delta\log (f_X(x_i)\Delta) \\ & = -\sum_i \Delta f_X(x_i)\log f_X(x_i) -\sum_i f_X(x_i)\Delta\log\Delta \\ & = -\sum_i \Delta f_X(x_i)\log f_X(x_i) - \log\Delta \\ & \approx -\int_\Omega f_X(x)\log f_X(x) dx -\log \Delta \end{align} $$

We see that as $\Delta\rightarrow 0 \Rightarrow -\log\Delta\rightarrow\infty$. It makes sense in a way, as the information corresponding to an specific outcome $X=x$ from a continuous rv. has zero probability, and hence has infinite information.

This is the reason we refer to $h(X)$ as the differential information - it is the difference in information from different continuous rvs.

We can define the joint and conditional entropies in the same way as for discrete rvs.

$$ \begin{align} h(X,Y) & = -\int_{\Omega_X\times\Omega_Y} f_{X,Y}(x,y)\log f_{X,Y}(x,y) dxdy \\ & = E_{X,Y}\left[ -\log f_{X,Y}(X,Y) \right] \\ h(X|Y) & = -\int_{\Omega_X\times\Omega_Y} f_{X,Y}(x,y)\log f_{X|Y}(x|y) dxdy \\ & = E_{X|Y}\left[ -\log f_{X|Y}(X|Y) \right] \\ \end{align} $$

We can also define the continuous versions of the Kullback-Leibler divergence and the Mutual Information:

$$ \begin{align} D(X\|Y) &= E_{X,Y}\left[ \log\frac{f_Y(Y)}{f_X(X)} \right] \\ I(X;Y) &= E_{X,Y}\left[ \log\frac{f_{X,Y}(X,Y)}{f_X(X)f_Y(Y)} \right] \end{align} $$

Because of the linearity of the integral, all the relationships between entropy, joint entropy, conditional entrop$y$ and mutual information hold as well.

Example 1:

The differential entropy of a zero-mean normal distribution $$ \begin{equation} f_X(x) = (2\pi\sigma^2)^{-\frac{1}{2}} e^{-\frac{1}{2}\frac{x^2}{\sigma^2}} \end{equation} $$ Using the definition of the differential entropy we have $$ \begin{align} h(X) &= -\int_{-\infty}^{\infty} f_X(x)\log f_X(x) dx \\ &= -\int_{-\infty}^{\infty} f_X(x) \log \left( (2\pi\sigma^2)^{-\frac{1}{2}} e^{-\frac{1}{2}\frac{x^2}{\sigma^2}} \right)dx \\ &= -\int_{-\infty}^{\infty} f_X(x) \left( -\frac{1}{2} \log (2\pi\sigma^2) -\frac{1}{2}\frac{x^2}{\sigma^2} \right)\\ &= \frac{1}{2} \log (2\pi\sigma^2) -\int_{-\infty}^{\infty} f_X(x) \left( -\frac{1}{2}\frac{x^2}{\sigma^2} \right) dx \\ & = \frac{1}{2} \log (2\pi\sigma^2) + \frac{1}{2} \\ & = \frac{1}{2} \log (2\pi\sigma^2) + \frac{1}{2}\log e \\ & = \frac{1}{2} \log (2e\pi\sigma^2) \end{align} $$

Example 2:

We can discuss this independently of the underlying distribution. Let $X$ ba an rv. with mean $E_X[X] = \mu$ and pdf. $f_X(X)$, and let $Y = X - \mu$ be an rv. with $E_Y[Y] = 0$ and pdf. $f_Y(Y) = f_X(Y+\mu)$. Then

$$ \begin{align} h(Y) & = -\int_{-\infty}^\infty f_Y(y)\log f_Y(y) dy \\ & = -\int_{-\infty}^\infty f_X(y+\mu)\log f_X(y+\mu) dy \\ & = -\int_{-\infty}^\infty f_X(x)\log f_X(x) dx,~x = y + \mu \\ & = h(X) \end{align} $$

In other words, the entropies of rvs. $X$ and $Y = X - \mu$ is always the same.

4.1.10: Run-length Coding

4.1.11: Arithmetic coding

4.1.12: Lempel-Ziv (LZW) coding

Lempel-Ziv (LZ) codes are widely used for universallossless compression. Due to flexible adaptation mechanisms, these are in principle applicable for compression of any kind of digital data.

Properties of different entropy-coding methods:

Codewords Source vectors, fixed-length variable-length
fixed-length Fixed-length coding (FLC), systematic VLC with 'escape' condition Run-length coding with FLC, Lempel-Ziv coding
variable-length Shannon & Huffinan coding, systematic VLC with 'typical' condition Run-length coding with VLC, arithmetic coding

4.1.13: Vector Quantization

4.2: Rate Distortion Theory

There is a fundamental tradeoff between rate and distortion. This is given by the rate-distortion function. We have previously discussed lossless coding/compression in some detail, and seen that there exists absolute bounds on the performance achievable for a given information source in terms of its entropy.

A natural question is if there exists similar results for lossy compression. The short answer is - yes, but we need some definitions before we can formalize the theory.

We define a distortion function or distortion measure as a mapping \begin{equation} d:{\cal X}\times{\cal \hat X} \rightarrow\mathbb R^{+} \end{equation} where ${\cal X}$ is the set of symbols to be coded (discrete or continuous), ${\cal \hat X}$ is the set representing compressed symbols, and $\mathbb R^{+}$ is the set of positive, real numbers. Examples of distortion functions are the Hamming, or probability of error, distortion

$$ \begin{equation} d(x,\hat x) = \left\{ \begin{array}{lc} 0 & x = \hat x \\ 1 & x \neq \hat x \end{array} \right. \end{equation} $$

for discrete binary symbols, and the squared error distortion

\begin{equation} d(x,\hat x) = \|x - \hat x\|^2\ \end{equation}

for real numbers. We can define the distortion between sequences $x^n$ and $\hat x^n$ as

\begin{equation} d(x^n,\hat x^n) = \frac{1}{n}\sum_{i=1}^n d(x_i,\hat x_i) \end{equation}

We define the $(2^{nR},n)$-rate distortion code as an encoding function

$$ \begin{equation} f_n:{\cal X}^n\rightarrow \{1,2,3,\ldots,2^{nR}\} \end{equation} $$

an a decoding function

$$ \begin{equation} g_n:\{1,2,3,\ldots,2^{nR}\} \rightarrow {\cal \hat X}^n \end{equation} $$

We can the write the expected distortion associated with the $(2^{nR},n)$-rate distortion code as

$$ \begin{equation} D = E_X \left[ d(X^n,g_n(f_n(X^n))) \right] \end{equation} $$

We are now almost in a position to define the Rate Distortion function $R(D)$. We first need to define what is meant by an achievable rate distorion pair (R,D), and a rate distortion region:

4.2.1: Rate-distortion pair

Definition: A rate distortion pair (R,D) is achievable if there exists a sequence of $(2^{nR},n)$-rate distortion codes $(f_n,g_n)$ so that

$$ \begin{equation} \lim_{n\rightarrow\infty} E_X \left[ d(X^n,g_n(f_n(X^n))) \right] \leq D \end{equation} $$

4.2.1: Rate-distortion region

Definition: The rate distortion region is the closure of of the set of achievable rate distortion pairs $(R,D)$.

Rate Distortion Region

4.2.3: Rate distortion function

Definition: The rate distortion function is the infimum over all $R$ contained in the rate distortion region for a given distortion $D$.

This definition of the rate distortion function, though intuitive, is highly abstract and not constructive. We will now give a more concrete definition of the rate distortion function $R(D)$, but first we need to discuss the joint distribution of $X$ and $\hat X$, $f_{X,\hat X}(X,\hat X)$. Let us look at a simple uniform quantizer as shown below:

Simple quantizer

We see that the mapping from $X$ to $\hat X = Q(X)$ is a simple deterministic mapping. In that sense $f_{\hat X|X}(\hat X|X)$ may not seem very interesting as it is one for $\hat X = Q(X)$ and zero otherwise. But just by that property it does actually specify the quantizer completely! In that sense, specifying different $f_{\hat X|X}(\hat X|X)$ can in principle determine underlying quantizers.

Definition: The rate distortion function $R(D)$ for an iid. discrete source $X$ and a bounded distortion measure $d(x,\hat x)$ is defined as $$ \begin{equation} R(D) = \min_{P(\hat X| X):\sum_{x,\hat x} P(x)P(\hat x|x) d(x,\hat x)\leq D} I(X;\hat X) \end{equation} $$

Definition: The rate distortion function $R(D)$ extended to continuous alphabets and a squared error distortion measure is defined as

$$ \begin{equation} R(D) = \min_{f_{\hat X|X}(\hat X| X): E \left[ \|X - \hat X \| \right]\leq D} I(X;\hat X) \end{equation} $$

The rate distortion theorem states that $R(D)$ as defined above in fact is the rate distortion function as defined using the rate distortion region. The proof is outside of the scope of this course however.

4.2.4: Examples

4.2.4.1: Binary source

The rate distortion function for a binary source $X$ with $P(X=0) = p$ and Hamming distortion is

$$ \begin{equation} R(D) = \left\{ \begin{array}{lc} H(p) - H(D) & 0 \leq D \leq \min\{p,1-p\} \\ 0 & \text{otherwise} \end{array} \right. \end{equation} $$

Proof: We can easily prove this using a series of inequalities that we later show are tight for a concrete distribution $P(\hat X | X)$. Without loss of generality we make the assumption that $p\leq \frac{1}{2}$. Also, we define $X \oplus \hat X$ as addition modulo 2, which means that \begin{equation} X \oplus \hat X = 1 \Rightarrow X \neq \hat X \end{equation}

We wish to calculate

$$ \begin{align} R(D) = \min_{P(\hat X| X):\sum_{x,\hat x} P(x)P(\hat x|x) d(x,\hat x)\leq D} I(X;\hat X) \end{align} $$

We start by computing the mutual information when $P(\hat X|X)$ fulfills the constraint $$ \begin{align} D & \geq \sum_{x,\hat x} P(x)P(\hat x|x) d(x,\hat x) \\ & = \sum_{x,\hat x} P(x, \hat x) d(x,\hat x) \\ & = P(X \neq \hat X) \end{align} $$ The last result comes from the fact that $d(x,\hat x) = $ only when $X\neq \hat X$. Then we have $$ \begin{align} I(X;\hat X) & = H(X) - H(X|\hat X) \\ & = H(p) - H(X\oplus\hat X| \hat X) \\ & \geq H(p) - H(X\oplus\hat X) \\ & \geq H(p) - H(D) \end{align} $$ The first inequality is comes from the property of the entropy that $H(Y) \geq H(Y|X)$. The second inequality comes from the constraint that $$ \begin{equation} D \geq P(X\neq \hat X) \end{equation} $$ which in turns implies that $H(X\oplus\hat X) \leq H(D)$ for $D\leq\frac{1}{2}$. We have thus proven that

$$ \begin{equation} R(D) \geq H(p) - H(D) \end{equation} $$

We now show that the inequality is tight if we choose the conditional distribution $P(X|\hat X)$ carefully. From the proof we see that if we choose $$ \begin{equation} H(X|\hat X) = H(D) \end{equation} $$

we get the equality $$ \begin{equation} R(D) = H(p) - H(D) \end{equation} $$ We can illustrate this by a figure Achieving the rate distortion function

Equality is achieved by choosing $P(\hat X = 0) = q$ so that $$ \begin{equation} q(1-D) + (1-q)D = 1 - p \end{equation} $$ holds. As seen in the figure, this is achieved when \begin{equation} P(\hat X = 0) = \frac{1 - p - D}{1 -2D} \end{equation}

import numpy as np import matplotlib.pyplot as plt

def H(p):
    return -p*np.log2(p) - (1-p)*np.log2(1-p)

def R(D,p):
    R = H(p) - H(D)
    R[ p < D] = 0
    return R

p = 1/2
D = np.linspace(1e-12,1-1e-12,100)

fig, ax = plt.subplots()
ax.plot(D,R(D,p),label="p={}".format(p))
ax.grid()
ax.set_xlabel("Distortion D")
ax.set_ylabel("Rate R")
ax.legend()

4.2.4.1: Continues source

We now look at the rate distortion function corresponding to a Gaussian source \begin{equation} X \sim {\cal N}(x;0,\sigma^2) \end{equation}

We have already show the rate distortion function for a continuous rv. with squared error distortion measure to be $$ \begin{equation} R(D) = \min_{f(\hat X| X): E \left[ \|X - \hat X \| \right]\leq D} I(X;\hat X) \end{equation} $$ Finding the rate distortion function for a Gaussian source is in principle found in the same way as for the binary source - first provide an inequality giving a lower bound for $R(D)$, then show that the bound can be achieved using a specific distribution.

We won't show how to compute $R(D)$ for a Gaussian source in detail, but will instead provide the expression here. For those interested, you can find a proof in chapter 10.3 of Elements of Information Theory by Cover and Thomas

$$ \begin{equation} R(D) = \left\{ \begin{array}{lc} \frac{1}{2}\log \frac{\sigma^2}{D} & 0 \leq D \leq \sigma^2 \\ 0 & \sigma_2 < D \end{array} \right. \end{equation} $$

We plot the function below

sigmasq = 1.0
D = np.linspace(1e-6, 2*sigmasq,10000)
R = np.maximum(0.5*np.log(sigmasq/D),0)

fig, ax = plt.subplots()
ax.plot(D,R)
ax.grid()
ax.set_xlabel("Distortion D")
ax.set_ylabel("Rate R")

Two observations:

  • The rate $R=0$ for distortion $D > \sigma^2$. This makes sense as just coding any sample from $X$ as zero will yield an average distortion of $D=\sigma^2$
  • The rate $R\uparrow\infty$ as $D\downarrow 0$. Again, this makes sense as we need an infinite number of bits to represent a real number.

Part 5: Machine learning

5.1: Neural Networks

5.1.1: Training

5.1.1.1: Batch

5.1.1.1: Mini-Batch

5.1.1.1: Stochastic gradient descent

5.1.1.2: Back propagation

5.1.2: Deep Naural Networks

5.1.3: Convolutional Neural Networks

5.1.4: Recurrent neural networks

5.2: Encoder-decoder networks

5.3: Attention

5.3.1: Tranformer networks

5.3: Auto-encoders

Written by

hoppsann3 sebassr
Last updated: Wed, 25 May 2022 13:05:56 +0200 .
  • Contact
  • Twitter
  • Statistics
  • Report a bug
  • Wikipendium cc-by-sa
Wikipendium is ad-free and costs nothing to use. Please help keep Wikipendium alive by donating today!