当前位置：网站首页>可爱的图像分类——Conv网络终于出了一口气：打爆了Swin的ConvNeXt

        当年Vision Transformer出现之时，号称打打爆了各大conv网络，实现了刷榜，后来出来的Swin Transformer更是犹有过之而无不及。
        我们不经在想，难道Conv网络的时代真的落幕了嘛？
        而且Transformer网络大多复杂难读，且计算量都非常的大，起码，我的电脑就跑不动Transformer网络，而相比之下Conv网络可爱多了。
现如今，Conv网络终于出了一口气：号称打爆各大Conv网络的Transformer网络反而被Conv网络技压一筹！
我什么这么说呢？我们先看几组数据对比：

这是ConvNeXt网络论文中展示的结果，我们可以明显看出，在ImageNet数据集中，ConvNeXt实现了刷榜，而且更难能可贵的是ConvNeXt网络结构非常简单！整体主干只需100行左右的代码！

ConvNeXt不仅仅是在ImageNet数据集中表现好，而且在相同的FLOPs下COCO等数据集准确率依旧高于Swin Transformer.
当然，Transformer网络巧妙的设计依旧是令人十分的赞叹的。

毫无新意的 ConvNext网络：

ConvNext未引进任何新的结构，仅仅在前人的结构之上，做出了非常精妙的修改，从而实现了效果比Swin Transformer更好的网络。

ConvNeXt网络的设计：

· Macro design

MacroDesign分为两部分：stage ratio 和 'patchlify stem',作者在这两方面对Macro Design进行了改进。

卷积结构：

stage ratio的改进：

stage ratio主要是 conv2.x : conv3.x : conv4.x : conv5.x

我们可以看出：ResNet50的比例是（3:4:6:3），ConvNeXt将比例变为了：（3:3:9:3）这样做有什么好处呢？
这样一改，准确率由78.8%提升到79.4%。

patchlify stem的改进：

在卷积神经网络中，我们将最初的下采样模块conv1我们称之为Stem.

我们可以看出：ResNet50的Stem是7×7的卷积以及步长为2的最大池化下采样组成，作者改为了：卷积核大小为4，步长为4的Stem。
这样一改：准确率提高到了79.5%。

· ResNeXt

ResNext由depth conv 和 width组成。

ResNeXt相比于普通的ResNet而言在FLOPs以及accuracy之间做到了更好的平衡。这里作者采用的是更激进的depthwise convolution。

· Inverted bottleneck

Inverted bottleneck主要由inverting dims组成。

作者认为Transformer block中的MLP模块非常像MobileNetV2中的Inverted Blottleneck模块，即两头细中间粗。

作者在加入了Inverted bottleneck之后，在较小的模型上准确率由80.5%提高到了80.6%,在较大的模型中准确率由81.9%提升到了82.6%。

· Large kerner size

Moving up depthwise conv layer, 将depthwise_conv模块上移动
原来是 1×1 conv --> depthwise_conv --> 1×1conv
现在变成 depthwise_conv --> 1×1 conv --> 1×1conv
Increasing the kernel size, 将depthwise conv的卷积核大小由3×3改成了7×7.

· Various layer-wise Micro designs

改进：

· Replacing ReLU with GELU （替换了激活函数）
· Fewer activation functions （减少了激活函数的使用）
· Fewer normalization layers （减少了Normalization标准化层）
· Substituting BN with LN （部分使用LN替换了BN）
· Separate downsampling layers （分离了下采用层）

ConvNeXt各种版本查看：

ConvNeXt-T: C=(96,192,384,768)， B=(3,3,9,3)
ConvNeXt-S: C=(96,192,384,768)， B=(3,3,27,3)
ConvNeXt-B: C=(128,256,512,1024)， B=(3,3,27,3)
ConvNeXt-L: C=(192,384,768,1536), B=(3,3,27,3)
ConvNeXt-XL: C=(256,512,1024,2048)， B=(3,3,27,3)

ConvNeXt主干网络代码查看：

pytorch版本：

"""
original code from facebook research:
https://github.com/facebookresearch/ConvNeXt
"""

import torch
import torch.nn as nn
import torch.nn.functional as F


def drop_path(x, drop_prob: float = 0., training: bool = False):
    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
    'survival rate' as the argument.

    """
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output


class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)


class LayerNorm(nn.Module):
    r""" LayerNorm that supports two data formats: channels_last (default) or channels_first.
    The ordering of the dimensions in the inputs. channels_last corresponds to inputs with
    shape (batch_size, height, width, channels) while channels_first corresponds to inputs
    with shape (batch_size, channels, height, width).
    """

    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape), requires_grad=True)
        self.bias = nn.Parameter(torch.zeros(normalized_shape), requires_grad=True)
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise ValueError(f"not support data format '{self.data_format}'")
        self.normalized_shape = (normalized_shape,)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            # [batch_size, channels, height, width]
            mean = x.mean(1, keepdim=True)
            var = (x - mean).pow(2).mean(1, keepdim=True)
            x = (x - mean) / torch.sqrt(var + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x


class Block(nn.Module):
    r""" ConvNeXt Block. There are two equivalent implementations:
    (1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W)
    (2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back
    We use (2) as we find it slightly faster in PyTorch

    Args:
        dim (int): Number of input channels.
        drop_rate (float): Stochastic depth rate. Default: 0.0
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
    """
    def __init__(self, dim, drop_rate=0., layer_scale_init_value=1e-6):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)  # depthwise conv
        self.norm = LayerNorm(dim, eps=1e-6, data_format="channels_last")
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # pointwise/1x1 convs, implemented with linear layers
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)
        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim,)),
                                  requires_grad=True) if layer_scale_init_value > 0 else None
        self.drop_path = DropPath(drop_rate) if drop_rate > 0. else nn.Identity()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        shortcut = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # [N, C, H, W] -> [N, H, W, C]
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        if self.gamma is not None:
            x = self.gamma * x
        x = x.permute(0, 3, 1, 2)  # [N, H, W, C] -> [N, C, H, W]

        x = shortcut + self.drop_path(x)
        return x


class ConvNeXt(nn.Module):
    r""" ConvNeXt
        A PyTorch impl of : `A ConvNet for the 2020s`  -
          https://arxiv.org/pdf/2201.03545.pdf
    Args:
        in_chans (int): Number of input image channels. Default: 3
        num_classes (int): Number of classes for classification head. Default: 1000
        depths (tuple(int)): Number of blocks at each stage. Default: [3, 3, 9, 3]
        dims (int): Feature dimension at each stage. Default: [96, 192, 384, 768]
        drop_path_rate (float): Stochastic depth rate. Default: 0.
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
        head_init_scale (float): Init scaling value for classifier weights and biases. Default: 1.
    """
    def __init__(self, in_chans: int = 3, num_classes: int = 1000, depths: list = None,
                 dims: list = None, drop_path_rate: float = 0., layer_scale_init_value: float = 1e-6,
                 head_init_scale: float = 1.):
        super().__init__()
        self.downsample_layers = nn.ModuleList()  # stem and 3 intermediate downsampling conv layers
        stem = nn.Sequential(nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
                             LayerNorm(dims[0], eps=1e-6, data_format="channels_first"))
        self.downsample_layers.append(stem)

        # 对应stage2-stage4前的3个downsample
        for i in range(3):
            downsample_layer = nn.Sequential(LayerNorm(dims[i], eps=1e-6, data_format="channels_first"),
                                             nn.Conv2d(dims[i], dims[i+1], kernel_size=2, stride=2))
            self.downsample_layers.append(downsample_layer)

        self.stages = nn.ModuleList()  # 4 feature resolution stages, each consisting of multiple blocks
        dp_rates = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
        cur = 0
        # 构建每个stage中堆叠的block
        for i in range(4):
            stage = nn.Sequential(
                *[Block(dim=dims[i], drop_rate=dp_rates[cur + j], layer_scale_init_value=layer_scale_init_value)
                  for j in range(depths[i])]
            )
            self.stages.append(stage)
            cur += depths[i]

        self.norm = nn.LayerNorm(dims[-1], eps=1e-6)  # final norm layer
        self.head = nn.Linear(dims[-1], num_classes)
        self.apply(self._init_weights)
        self.head.weight.data.mul_(head_init_scale)
        self.head.bias.data.mul_(head_init_scale)

    def _init_weights(self, m):
        if isinstance(m, (nn.Conv2d, nn.Linear)):
            nn.init.trunc_normal_(m.weight, std=0.2)
            nn.init.constant_(m.bias, 0)

    def forward_features(self, x: torch.Tensor) -> torch.Tensor:
        for i in range(4):
            x = self.downsample_layers[i](x)
            x = self.stages[i](x)

        return self.norm(x.mean([-2, -1]))  # global average pooling, (N, C, H, W) -> (N, C)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.forward_features(x)
        x = self.head(x)
        return x


def convnext_tiny(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_tiny_1k_224_ema.pth
    model = ConvNeXt(depths=[3, 3, 9, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_small(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_small_1k_224_ema.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_base(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224_ema.pth
    # https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_224.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[128, 256, 512, 1024],
                     num_classes=num_classes)
    return model


def convnext_large(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_224_ema.pth
    # https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_224.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[192, 384, 768, 1536],
                     num_classes=num_classes)
    return model


def convnext_xlarge(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_224.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[256, 512, 1024, 2048],
                     num_classes=num_classes)
    return model

TensorFlow版本：

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, initializers, Model

KERNEL_INITIALIZER = {
    "class_name": "TruncatedNormal",
    "config": {
        "stddev": 0.2
    }
}

BIAS_INITIALIZER = "Zeros"


class Block(layers.Layer):
    """
    Args:
        dim (int): Number of input channels.
        drop_rate (float): Stochastic depth rate. Default: 0.0
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
    """
    def __init__(self, dim, drop_rate=0., layer_scale_init_value=1e-6, name: str = None):
        super().__init__(name=name)
        self.layer_scale_init_value = layer_scale_init_value
        self.dwconv = layers.DepthwiseConv2D(7,
                                             padding="same",
                                             depthwise_initializer=KERNEL_INITIALIZER,
                                             bias_initializer=BIAS_INITIALIZER,
                                             name="dwconv")
        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")
        self.pwconv1 = layers.Dense(4 * dim,
                                    kernel_initializer=KERNEL_INITIALIZER,
                                    bias_initializer=BIAS_INITIALIZER,
                                    name="pwconv1")
        self.act = layers.Activation("gelu")
        self.pwconv2 = layers.Dense(dim,
                                    kernel_initializer=KERNEL_INITIALIZER,
                                    bias_initializer=BIAS_INITIALIZER,
                                    name="pwconv2")
        self.drop_path = layers.Dropout(drop_rate, noise_shape=(None, 1, 1, 1)) if drop_rate > 0 else None

    def build(self, input_shape):
        if self.layer_scale_init_value > 0:
            self.gamma = self.add_weight(shape=[input_shape[-1]],
                                         initializer=initializers.Constant(self.layer_scale_init_value),
                                         trainable=True,
                                         dtype=tf.float32,
                                         name="gamma")
        else:
            self.gamma = None

    def call(self, x, training=False):
        shortcut = x
        x = self.dwconv(x)
        x = self.norm(x, training=training)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)

        if self.gamma is not None:
            x = self.gamma * x

        if self.drop_path is not None:
            x = self.drop_path(x, training=training)

        return shortcut + x


class Stem(layers.Layer):
    def __init__(self, dim, name: str = None):
        super().__init__(name=name)
        self.conv = layers.Conv2D(dim,
                                  kernel_size=4,
                                  strides=4,
                                  padding="same",
                                  kernel_initializer=KERNEL_INITIALIZER,
                                  bias_initializer=BIAS_INITIALIZER,
                                  name="conv2d")
        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")

    def call(self, x, training=False):
        x = self.conv(x)
        x = self.norm(x, training=training)
        return x


class DownSample(layers.Layer):
    def __init__(self, dim, name: str = None):
        super().__init__(name=name)
        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")
        self.conv = layers.Conv2D(dim,
                                  kernel_size=2,
                                  strides=2,
                                  padding="same",
                                  kernel_initializer=KERNEL_INITIALIZER,
                                  bias_initializer=BIAS_INITIALIZER,
                                  name="conv2d")

    def call(self, x, training=False):
        x = self.norm(x, training=training)
        x = self.conv(x)
        return x


class ConvNeXt(Model):
    r""" ConvNeXt
        A Tensorflow impl of : `A ConvNet for the 2020s`  -
          https://arxiv.org/pdf/2201.03545.pdf
    Args:
        num_classes (int): Number of classes for classification head. Default: 1000
        depths (tuple(int)): Number of blocks at each stage. Default: [3, 3, 9, 3]
        dims (int): Feature dimension at each stage. Default: [96, 192, 384, 768]
        drop_path_rate (float): Stochastic depth rate. Default: 0.
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
    """
    def __init__(self, num_classes: int, depths: list, dims: list, drop_path_rate: float = 0.,
                 layer_scale_init_value: float = 1e-6):
        super().__init__()
        self.stem = Stem(dims[0], name="stem")

        cur = 0
        dp_rates = np.linspace(start=0, stop=drop_path_rate, num=sum(depths))
        self.stage1 = [Block(dim=dims[0],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage1_block{i}")
                       for i in range(depths[0])]
        cur += depths[0]

        self.downsample2 = DownSample(dims[1], name="downsample2")
        self.stage2 = [Block(dim=dims[1],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage2_block{i}")
                       for i in range(depths[1])]
        cur += depths[1]

        self.downsample3 = DownSample(dims[2], name="downsample3")
        self.stage3 = [Block(dim=dims[2],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage3_block{i}")
                       for i in range(depths[2])]
        cur += depths[2]

        self.downsample4 = DownSample(dims[3], name="downsample4")
        self.stage4 = [Block(dim=dims[3],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage4_block{i}")
                       for i in range(depths[3])]

        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")
        self.head = layers.Dense(units=num_classes,
                                 kernel_initializer=KERNEL_INITIALIZER,
                                 bias_initializer=BIAS_INITIALIZER,
                                 name="head")

    def call(self, x, training=False):
        x = self.stem(x, training=training)
        for block in self.stage1:
            x = block(x, training=training)

        x = self.downsample2(x, training=training)
        for block in self.stage2:
            x = block(x, training=training)

        x = self.downsample3(x, training=training)
        for block in self.stage3:
            x = block(x, training=training)

        x = self.downsample4(x, training=training)
        for block in self.stage4:
            x = block(x, training=training)

        x = tf.reduce_mean(x, axis=[1, 2])
        x = self.norm(x, training=training)
        x = self.head(x)
        return x


def convnext_tiny(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 9, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_small(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_base(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[128, 256, 512, 1024],
                     num_classes=num_classes)
    return model


def convnext_large(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[192, 384, 768, 1536],
                     num_classes=num_classes)
    return model


def convnext_xlarge(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[256, 512, 1024, 2048],
                     num_classes=num_classes)
    return model

原网站

版权声明
本文为[舞雩.]所创，转载请带上原文链接，感谢
https://blog.csdn.net/qq_51831335/article/details/125808121