Trying to understand alias-free-gan.

Overview

alias-free-gan-explanation

Trying to understand alias-free-gan in my own way.

[Chinese Version 中文版本]

CC-BY-4.0 License. Tzu-Heng Lin

motivation of this article: So, the thing is, I've been reading the paper for several days, and the paper is written in a way that I find really hard to understand. Thus, I decide to rephrase the main idea of the paper in my language. Some explanation might be different from the original paper. Of course, I might be making mistakes, so please feel free to correct me.

disclaimer: This paragraph is only my personal understanding. One is suggested to read the original paper. Details related to implemtation are not discussed here.

Karras, Tero, et al. Alias-Free Generative Adversarial Networks. arXiv preprint arXiv:2106.12423 (2021).

[Original Paper] [Code]

Overall Logic:

  • Modeling
    • Signals flow through the network are interpreted as continuous signals. The actually used feature maps are just discrete samples of them.
  • Problem Identifying
    • Discovering that current network architectures do not have a clear mechanism to restrict the generator to synthesis images in a strict hierarchical manner. Moreover, due to the fact that the frequencies of feature maps do not meet the condition of Nyquisit-Shannon Sampling Theorem, aliasing happens.
  • Problem Solving
    • Redesign a network that is alias-free and strictly follows the hierarchical synthesizing manner.
  • Analysis
    • We can show that alias-free generators are translation or rotation equivariant.
    • We can show that equivariant generators do not encounter the "texture sticking" phenomenon.

TOC

1. Motivation

1.1 Continuous and Discrete Signals

image-20210916233015961
Firstly, we need to interpret the information flow through the network in a more appropriate way. (with signal processing).
  • The authors utilize the concepts in signal processing, and interpret the information flow through the network as spatially infinite continuous signals. The feature maps we actually used are just discrete samples of the continuous signals in a targeted canvas. They can be seen as a convenient encoding of the continuous singals. If we set the unit square [0, 1] in the continuous singals as our targeted canvas, the size of the feature maps can then represent the sampling rate when converting continuous signals to discrete ones.
  • The high/low frequencies we are talking about are those frequencies we obtained in the frequency domain after we apply fourier transform to the continuous singals.
  • Since the procedure is sampling, the conditions of Nyquist-Shannon Sampling Theorem need to be satisfied. That is to say, the highest frequencies of the continuous signals must be smaller than half of the sampling rate (this is often called the Nyquist Frequency), or else the problem of aliasing would happen. (See Figure below.)

1.2 Problems of Exisiting Architecture

Ideal way for GANs to synthesize information:

  • Hierarchical Manner: From shallow to deep layers, synthesizing features from coarse to fine, from low to high frequencies. (For example, synthesizing a face would follow orders like: overall contour of the face -> ... -> skin -> pores, beard, other textures on skin)

Problems for existing GANs:

  • We find that existing GAN network architectures do not have a mechanism to restrict the generator to synthesis images in a strict hierarchical manner. Although they limit the resolution of feature maps in each layers to let feature maps in shallow layers cannot represent high frequency signals, but the new frequencies generated by operations in each layer, cannot be guarenteed to be smaller than the corresponding Nyquisit Frequency. If the above condition does not meet, the problem of aliasing would happen, which would make the high frequencies to be represented as low frequencies in the frequency domain, contaminating the whole signal.

1.3 Main Contribution

We want to design a network architecture, that strictly follows the ideal hierarchical manner of synthesizing information. Every layer is restricted to only synthesizing frequencies in the range that we designated to them, and thus, removing the problem of aliasing. (That's why the paper is called Alias Free GAN, IMO).

2. Method

2.1 Basic Op Redesign

Existing GANs contain basic Operations like: Conv, Upsamling, Downsampling, Nonlinearity. In the following, we will analyze them respectively, to see if they have the problem of aliasing. And if so, how do we fix them.

  • Conv

    • Convolution, it is used to locally reorganize signals, producing signals that meet our expectations more.
    • Convolution itself does not introduce new frequencies. (Convolution in time domain is equivalent to multiplication in the frequency domain. So where originally 0 is still 0 in the frequency domain).
  • Downsampling (See Figure Below)

    • Resample a signal to a lower sampling rate (s -> s', where s>s'). It is used to let the viable area smaller in the spectrum.

    • Notice that the sampling rate afterwards could be smaller than twice of the highest frequencies of the original signal. Thus, we need to use a Low Pass Filter beforehand to restrict the frequencies of the original signal to be less than half of the lowered sampling rate, then can we do the downsampling procedure (dropping points).

      upsample
  • Upsampling (See FIgure Below)

    • Resample a signal to a higher sampling rate (s -> s', where s<s'). It is use to add headroom in the spectrum, to let the viable area larger (So that subsequent layers can introduce new frequencies). Note that itself does not introduce new frequencies.

    • The procedure is achieved by first interleaving the original signals with 0, then use a Low Pass Filter to remove imaging in the frequency domain. Note that, the LPF used here is using cutoff=s/2, sampling rate=s'.

    • The upsampling and downsampling procedures introduced above might seem a little confusing for one who haven't learnt signal processing lessons before. However, they are actually the widely used procedures in the field of signal processing to resample signals. And they are very intuitive when explaining them with the Figure above.

      upsample
  • Nonlinearity (See Video)

    • Elementwisely nonlinearity (e.g. ReLU). It is used to introduce new frequencies.
    • The new frequencies introduced by nonlinearity contains two parts: the 1st part that meets the condition of the sampling theorem, and the 2nd part that doesn't. We want to preserve the former and eliminate the latter. However, if we directly apply nonlinearity to the discrete feature map, the newly introduced 2nd part frequencies will directly create aliasing.
    • Thus, the authors propose a very interesting method: Firstly, you upsample the signal by m (usually set to 2), then you apply the nonlinearity, and finally you downsample the signal back. The first upsampling is to increase the Nyquisit Frequency, adding headroom for the 2nd part frequencies newly introduced to avoid aliasing. Then, the downsampling procedure (including a LPF to eliminate the 2nd part frequencies) convert the signal back to its original sampling rate.
  • Low Pass Filter

    • Notice that downsampling, upsamling, nonlineaity operation introduced above use LPF.
    • The authors use a Kaiser-Windowed Sinc Filter (a FIR LPF) because it can directly manipulate transition band and attenuation.
    • Two very good links on LPF and Kaier window: link1, link2.

2.2 Equivariant and Texture Sticking

Equivariant means that when the input translate, the output translate equivalently. We can define to kinds of equivariant: Translation Equivariant, and Roation Equivariant.

Translation Equivariant

  • We can show that the alias-free network is translation equivariant naturally.

    • According to the above theoretical analysis, if we treat the signal as infinite continuous signal in the time domain throughout the network, the shift of the signal in the time domain does not actually change the amplitude of the signal in the frequency domain. Therefore, no matter how you move the input signal up, down, left, and right in the time domain, the output of each layer of the network will move along with it, and the final output signal will definitely move along with it.
  • The authors define a metric to evaluate the translation equivariance: EQ-T. Basically, it calculates the difference between two sets of images: translating the input or output of the syntheis network by the same random amount.

    image-20210917130927681

Rotation Equivariant

  • For rotation equivariance, we need some modification to Conv and LPF

    • Conv: We need keernel to be radially symmetry in the time domain. This is easy to understand. If you rotate the input signal, the most intuitive and simple way is to perform the same rotation for Conv kernels. In this way, there is no relative movement between the two, which is equivalent to the original operation.
    • Low Pass Filter: We also need keernel to be radially symmetry in the time domain. The explanation is similar to Conv.
  • The authors define a metric to evaluate the rotation equivariance: EQ-R.

    image-20210917131037667

Texture Sticking (video)

image-20210916002435902
  • We can show that equivariant networks do not have such phenomenon. The manifestation of this phenomenon is that high and low frequency features will not be transformed at the same speed together. But if the network has equivariance, then all features must be transformed together at the same speed, and this phenomenon will naturally not occur.

2.3 Detailed Design of Overall Network Architecture

image-20210917020314793
Apart from the changes of the basic operations, there are other changes in the network architectures.
  • (config B,H) Fourier Features

    • (B) Change original 'learned constant input' to 'Fourier Features'.

      • According to the previous analysis, the input that we essentially deal with is an infinite continuous signal, so the authors use Fourier Features here, which naturally have spatially infinite characteristics. The discrete input signal can be sampled from the continuous expression. At the same time, because there is an actual continuous expression, we can also easily translate and rotate the signal, then sample it and input it into the network, so that we can calculate EQ-T and EQ-R conveniently.

      • What exactly does the Fourier Feature look like? The authors' official implementation in unknown yet. According to rosinality/alias-free-gan-pytorch, it uses each piece of feature map to represent some frequency of sin or cos signal on x or y direction (which makes it 4 feature maps for each frequency). Code is implemented here: plot_fourier_features.py.

    • (H) Transformed Fourier Features (Appendix F)

      • The above Fourier features are randomly rotated or translated in the time domain (that is, the style of w also controls the input signal), and then being fed into the network. w -> t = (rc, rs, tx, ty), t = t/sqrt(rc^2+rs^2). code is implemented here: plot_fourier_features.py
  • (config E) 10px margin expanded to the original feature maps

    • In the above theoretical assumptions, the signals are spatially infinite, and the Conv, Upsampling, and Downsampling calculations at the edge will also use the values outside the boundary of the targeted canvas, so here we can use the following approach to approximate the infinite feature map :

      • Expand the feature map by a 10px margin.

      • If the feature map is upsampled, the margin is also upsampled, so we need to crop the margin after upsampling to make it remain to 10px.

      • If there is no upsampling, then no extra care is needed.

        image-20210919175525459
  • (config E,G,T) Sampling rate and LPF design

    • (E) According to the above analysis, a very intuitive approach (critical sampling) is to set the cutoff fc of the low-pass filter to half of the sampling rate s/2, and set half of transition band fh to (\sqrt{2}-1) (s/2) .

    • (G) However, doing so is actually dangerous, because our low-pass filter is just an approximation, it is not an ideal rectangular window in the frequency domain, so there will be some missing frequencies that can still pass through around the critical point. So here, the authors set cutoff fc to s/2-fh. The intuitive understanding is to keep less and filter out more. It is safer to avoid aliasing. Except for the last few layers, cutoff is still set to s/2, because the last layers really needs more high-frequency features.

    • (T) The authors found that the attenuation of the aforementioned low-pass filter is still insufficient for the low-resolution layers. The original design philosophy have fixed rules for each layer. The authors propose to design each layer separately here. They hope to have as large attenuation as possible in the low resolution layers, and keep more high frequency features in the high resolution layers.

      • The right most figure below shows a N=14 Generator design. The last two layer is critical sampled.
      • The cutoff fc (blue line) grows geometrically from fc= 2 in the first layer to fc= sN/2 in the first critically sampled layer.
      • The minimum acceptable stopband freq ft (yellow line) starts at f_{t,0} = 2^2.1 , and grows geometrically but slower than the cutoff fc. For the last two layers, ft = fc * 2^0.3.
        • f_{t,0} provides an effective way to trade training speed for equivariance quality.
      • The sampling rate s is set to double of the smallest multiple of two which is larger than ft. (but not exceeding the final output resolution).
      • Half of the transition band fh = max(s/2, ft) -fc
      • Now the number of layers N is not completely dependent on the output resolution. The authors then set the number of layers for all resolutions to 14.
      image-20210917215737862 image-20210917214706969 image-20210917012305268
  • (config R) Rotation Equivariance. As stated above, we need to change Conv and LPF to radially symmetry kernels.

    • Conv: replace all 3x3 conv with 1x1.
    • LPF: use jinc filter with the same Kaiser Window: image-20210917222512704
  • (config C, D) Others

    • (C) removing per-pixel noise. Since the spectrum of gaussian noise has the same intensity on all frequency, obviously it does not meet the sampling theorem.
    • (D) simplify generator. including:
      • mapping network 8->2
      • eliminate mixing regularization
      • eliminate path length regularization
      • eliminate skip connection, change to normalization using EMA of sigma

3. Experiments

3.1 Dataset

  • FFHQ-U and MetFaces-U: unaligned version of FFHQ and MetFaces. Difference with the original version: Axis-aligned crop, preserving orginal image angel, random crop face region, no mirrored.
  • AFHQv2: The original AFHQ use inappropriate downsampling, which results in aliasing. The new version use PIL's Lanczos.
  • Beaches: 20,155 photos, 512x512

3.2 Quantitative and Qualitative Results

image-20210917205044756
  • FFHQ (1024×1024)
    • # of params of the three Generator are: 30.0M, 22.3M, 15.8M
    • Training time (GPU hour): 1106, 1576 (+42%), 2248 (+103%)
  • Equivariance (video, video)
  • Texture Sticking phenomenon disappear (video, video)

3.3 Ablation Study

image-20210917205203726 image-20210917205257000
  • mixing reg. does no harm, but is somewhat useless(Appendix A)
  • per-pixel noise compromises equivariances significantly.
  • Fixed Fourier Features harms FID.
  • path length reg. harms FID, but improves equivariance (strange behavior). (Path length regularization is in principle at odds with translation equivariance, as it penalizes image changes upon latent space walk and thus encourages texture sticking. We suspect that the counterintuitive improvement in equivariance may come from slightly blurrier generated images, at a cost of poor FID.)
  • Capacity: halving the number of feature maps harms FID but the network remains equivariant. Doubling the number improves FID, yet with 4x training time.
  • DIfferent window function for sinc/jinc filter: Kaier, Lanczos, Gaussian. Lanczos is best on FID yet compromises equivariance. Gaussian leads to clear worse FID.
  • p4 symmetry G-CNN is not even close compared to Alias-Free-R on rotation equivariance.

3.4 Feature Map Visualization

video

image-20210917204255781
Owner
Tzu-Heng Lin
DL, CV, GAN, RS, DM (see https://lzhbrian.me)
Tzu-Heng Lin
This is the code repository for the paper A hierarchical semantic segmentation framework for computer-vision-based bridge column damage detection

Bridge-damage-segmentation This is the code repository for the paper A hierarchical semantic segmentation framework for computer-vision-based bridge c

Jingxiao Liu 5 Dec 07, 2022
A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

SlowFast A PyTorch implementation of SlowFast based on ICCV 2019 paper SlowFast Networks for Video Recognition. Requirements Anaconda PyTorch conda in

Hao Ren 8 Dec 23, 2022
Official implementation of Densely connected normalizing flows

Densely connected normalizing flows This repository is the official implementation of NeurIPS 2021 paper Densely connected normalizing flows. Poster a

Matej Grcić 31 Dec 12, 2022
A Blender python script for getting asset browser custom preview images for objects and collections.

asset_snapshot A Blender python script for getting asset browser custom preview images for objects and collections. Installation: Click the code butto

Johnny Matthews 44 Nov 29, 2022
[CVPR'21] FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space

FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space by Quande Liu, Cheng Chen, Ji

Quande Liu 178 Jan 06, 2023
Dirty Pixels: Towards End-to-End Image Processing and Perception

Dirty Pixels: Towards End-to-End Image Processing and Perception This repository contains the code for the paper Dirty Pixels: Towards End-to-End Imag

50 Nov 18, 2022
Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

OpenAI 2.9k Jan 04, 2023
An image processing project uses Viola-jones technique to detect faces and then use SIFT algorithm for recognition.

Attendance_System An image processing project uses Viola-jones technique to detect faces and then use LPB algorithm for recognition. Face Detection Us

8 Jan 11, 2022
LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models. Developers can reproduce these SOTA methods and

TuZheng 405 Jan 04, 2023
This is the replication package for paper submission: Towards Training Reproducible Deep Learning Models.

This is the replication package for paper submission: Towards Training Reproducible Deep Learning Models.

0 Feb 02, 2022
A Fast and Stable GAN for Small and High Resolution Imagesets - pytorch

A Fast and Stable GAN for Small and High Resolution Imagesets - pytorch The official pytorch implementation of the paper "Towards Faster and Stabilize

Bingchen Liu 455 Jan 08, 2023
Lightweight Cuda Renderer with Python Wrapper.

pyRender Lightweight Cuda Renderer with Python Wrapper. Compile Change compile.sh line 5 to the glm library include path. This library can be download

Jingwei Huang 53 Dec 02, 2022
Official Code Release for "CLIP-Adapter: Better Vision-Language Models with Feature Adapters"

Official Code Release for "CLIP-Adapter: Better Vision-Language Models with Feature Adapters" Pipeline of CLIP-Adapter CLIP-Adapter is a drop-in modul

peng gao 157 Dec 26, 2022
DAN: Unfolding the Alternating Optimization for Blind Super Resolution

DAN-Basd-on-Openmmlab DAN: Unfolding the Alternating Optimization for Blind Super Resolution We reproduce DAN via mmediting based on open-sourced code

AlexZou 72 Dec 13, 2022
CLIPImageClassifier wraps clip image model from transformers

CLIPImageClassifier CLIPImageClassifier wraps clip image model from transformers. CLIPImageClassifier is initialized with the argument classes, these

Jina AI 6 Sep 12, 2022
Source code for "Interactive All-Hex Meshing via Cuboid Decomposition [SIGGRAPH Asia 2021]".

Interactive All-Hex Meshing via Cuboid Decomposition Video demonstration This repository contains an interactive software to the PolyCube-based hex-me

Lingxiao Li 131 Dec 05, 2022
This repository provides an efficient PyTorch-based library for training deep models.

s3sec Test AWS S3 buckets for read/write/delete access This tool was developed to quickly test a list of s3 buckets for public read, write and delete

Bytedance Inc. 123 Jan 05, 2023
Official Pytorch implementation for 2021 ICCV paper "Learning Motion Priors for 4D Human Body Capture in 3D Scenes" and trained models / data

Learning Motion Priors for 4D Human Body Capture in 3D Scenes (LEMO) Official Pytorch implementation for 2021 ICCV (oral) paper "Learning Motion Prior

165 Dec 19, 2022
[3DV 2021] Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation

Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation This is the official implementation for the method described in Ch

Jiaxing Yan 27 Dec 30, 2022
[ICLR'21] Counterfactual Generative Networks

This repository contains the code for the ICLR 2021 paper "Counterfactual Generative Networks" by Axel Sauer and Andreas Geiger. If you want to take the CGN for a spin and generate counterfactual ima

88 Jan 02, 2023