This is an article about the design of feature extraction blocks developed to improve multi-scale object detection while maintaining fast inference in real-world applications.
Cross-stage partial connections (CSP)
First, WongKinYiu et al. [1] introduced this architectural innovation and addressed the problem of redundant gradient information in larger convolutional neural network backbones. Its main objective is to enrich gradient interactions while reducing computational cost. Cross-stage partial connections (CSP) preserves gradient diversity by combining feature maps from both the beginning and the end of each network stage: the base layer’s feature maps are split into two parts: one passes through a dense block and transition layer, while the other bypasses this path and connects directly to the next stage. This architecture is designed to tackle multiple issues, including enhancing the learning capability of the CNN, removing computational bottlenecks, and lowering memory costs.
CSP-DenseNet keeps DenseNet’s feature reuse benefits while reducing duplicate gradient information by pruning gradient flow, achieved through a hierarchical feature fusion strategy in a partial transition layer.
According to the authors’ experiments, this approach reduces computation by 20% while achieving equivalent or even superior accuracy on the ImageNet dataset.
C3
YOLOv4 and YOLOv5 use the Cross Stage Partial (CSP) module to improve feature extraction in the bottleneck. The C3 block is a practical implementation of this CSP architecture in Ultralytics YOLO models.
In the C3 block, the input feature maps are split into two parts. One part is processed by a 1×1 convolution followed by n parallel bottleneck blocks, while the other part passes through a separate 1×1 convolution and skips the bottlenecks entirely. These two branches are then concatenated along the channel dimension and fused by another 1×1 convolution to produce the output.
Input (x)
│
┌────────┴─────────┐
│ │
[1x1 Conv] [1x1 Conv]
(cv1) (cv2)
│ │
[Bottlenecks] │
(m: n blocks) │
│ │
└────────┬─────────┘
│
[Concat along C]
│
[1x1 Conv → cv3]
│
Output
with ultralytics implementation (github link):
class C3(nn.Module):
"""CSP Bottleneck with 3 convolutions."""
def __init__(self, c1: int, c2: int, n: int = 1, shortcut: bool = True, g: int = 1, e: float = 0.5):
"""
Initialize the CSP Bottleneck with 3 convolutions.
Args:
c1 (int): Input channels.
c2 (int): Output channels.
n (int): Number of Bottleneck blocks.
shortcut (bool): Whether to use shortcut connections.
g (int): Groups for convolutions.
e (float): Expansion ratio.
"""
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, 1, 1)
self.cv2 = Conv(c1, c_, 1, 1)
self.cv3 = Conv(2 * c_, c2, 1)
self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, k=((1, 1), (3, 3)), e=1.0) for _ in range(n)))
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through the CSP bottleneck with 3 convolutions."""
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1))
Cross-Stage Partial with 2F connections (C2F)
The C2f block builds on CSPNet, extending it further: instead of a single fusion path, it introduces two parallel feature fusion connections, each with half the number of output channels. This idea, which first appeared in YOLOv7 and YOLOv8 [2][3], follows the same principles as CSP by splitting the input feature map to reduce computational redundancy and improve feature reuse.
In a C2f block, the input tensor is divided into two paths: one bypasses the Bottleneck layers as a shortcut, while the other passes through multiple Bottleneck layers. Unlike the original CSP, which uses only the final Bottleneck output, C2f gathers all intermediate Bottleneck outputs and concatenates them — boosting feature diversity and representation. This dual feature fusion (2F) strategy also helps the network handle occlusion better, making detections more robust in challenging scenes.
Ultralytics implementation (github link):
class C2f(nn.Module):
"""Faster Implementation of CSP Bottleneck with 2 convolutions."""
def __init__(self, c1: int, c2: int, n: int = 1, shortcut: bool = False, g: int = 1, e: float = 0.5):
"""
Initialize a CSP bottleneck with 2 convolutions.
Args:
c1 (int): Input channels.
c2 (int): Output channels.
n (int): Number of Bottleneck blocks.
shortcut (bool): Whether to use shortcut connections.
g (int): Groups for convolutions.
e (float): Expansion ratio.
"""
super().__init__()
self.c = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, 2 * self.c, 1, 1)
self.cv2 = Conv((2 + n) * self.c, c2, 1) # optional act=FReLU(c2)
self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through C2f layer."""
y = list(self.cv1(x).chunk(2, 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
def forward_split(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass using split() instead of chunk()."""
y = self.cv1(x).split((self.c, self.c), 1)
y = [y[0], y[1]]
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
Cross Stage Partial with kernel size 2 (C3k2) block
YOLOv11 [4] uses the following C3K2 blocks in the head for feature extraction at various stages of its backbone to process multi-scale features — another evolution of the classic CSP bottleneck. The C3K2 block splits the feature map and processes it with multiple lightweight 3×3 convolutions, merging the results afterwards. This improves information flow while remaining more compact than a complete CSP bottleneck, reducing the number of trainable parameters.
The C3K block keeps the same basic structure as C2f but doesn’t split the output after the initial convolution. Instead, it runs the input through n Bottleneck layers with intermediate concatenations, ending with a final 1×1 convolution. Unlike C2f, C3K adds flexibility with customisable kernel sizes, helping the model better capture fine details at different object scales.
Building on this idea, C3K2 replaces plain Bottlenecks with multiple C3K blocks. It begins with a Conv block, stacks several C3K blocks in sequence, concatenates their outputs with the original input, and finishes with another Conv layer — blending CSP’s split-merge concept with flexible kernels to balance speed, parameter efficiency, and richer multi-scale feature extraction.
Input: [Batch, c1, H, W]
│
[cv1] (1x1 Conv) → splits channels into 2c
│
┌─────────────┐
│ │
Branch 1 Branch 2
(Bypass) (Bottleneck chain)
│ │
├─> C3k Block #1
│
├─> C3k Block #2
│
... (n times)
│
└─────────────┬─────────────┐
Concatenate [Bypass, Split, C3k outputs]
│
[cv2] (1x1 Conv)
│
Output: [Batch, c2, H, W]
Each C3K block utilises parallel Bottlenecks with custom kernels, providing more flexibility for feature extraction and enabling the model to adapt better to complex patterns.
C3k Input: [Batch, c, H, W]
│
[cv1] (1x1 Conv, expand/split)
│
┌───────────────┐
│ │
ByPass Bottleneck blocks
│ ┌─────────────┐
│ B1, B2, ...Bn (parallel)
└─────────────┘
└───────────────┬───────┘
Concatenate
│
[cv2] (1x1 Conv)
│
C3k Output: [Batch, c, H, W]
Ultralytics implementation (github link):
class C3k(C3):
"""C3k is a CSP bottleneck module with customizable kernel sizes for feature extraction in neural networks."""
def __init__(self, c1: int, c2: int, n: int = 1, shortcut: bool = True, g: int = 1, e: float = 0.5, k: int = 3):
"""
Initialize C3k module.
Args:
c1 (int): Input channels.
c2 (int): Output channels.
n (int): Number of Bottleneck blocks.
shortcut (bool): Whether to use shortcut connections.
g (int): Groups for convolutions.
e (float): Expansion ratio.
k (int): Kernel size.
"""
super().__init__(c1, c2, n, shortcut, g, e)
c_ = int(c2 * e) # hidden channels
# self.m = nn.Sequential(*(RepBottleneck(c_, c_, shortcut, g, k=(k, k), e=1.0) for _ in range(n)))
self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, k=(k, k), e=1.0) for _ in range(n)))
class C3k2(C2f):
"""Faster Implementation of CSP Bottleneck with 2 convolutions."""
def __init__(
self, c1: int, c2: int, n: int = 1, c3k: bool = False, e: float = 0.5, g: int = 1, shortcut: bool = True
):
"""
Initialize C3k2 module.
Args:
c1 (int): Input channels.
c2 (int): Output channels.
n (int): Number of blocks.
c3k (bool): Whether to use C3k blocks.
e (float): Expansion ratio.
g (int): Groups for convolutions.
shortcut (bool): Whether to use shortcut connections.
"""
super().__init__(c1, c2, n, shortcut, g, e)
self.m = nn.ModuleList(
C3k(self.c, self.c, 2, shortcut, g) if c3k else Bottleneck(self.c, self.c, shortcut, g) for _ in range(n)
)
Conclusion
In short, modern YOLO architectures keep evolving by adding blocks like C3, C2f, C3k, and C3k2 — all built around the core idea of Cross-Stage Partial (CSP) connections. This CSP approach reduces computation and boosts feature representation at the same time.
Block |
Outer Structure |
Inner Structure |
Kernel flexibility |
---|---|---|---|
C3 |
Parallel Bottlenecks |
Bottlenecks |
Fixed kernels |
C2f |
Serial Bottlenecks |
Bottlenecks |
Fixed kernels |
C3k |
Parallel Bottlenecks |
Bottlenecks |
Custom kernels |
C3k2 |
Serial C3k blocks |
Each C3k has parallel Bottlenecks |
Custom kernels |
These architectural refinements collectively help YOLO models maintain high detection accuracy while remaining fast and lightweight enough for real-time deployment — a critical advantage for various applications
Links
-
https://arxiv.org/pdf/1911.11929
-
https://arxiv.org/pdf/2207.02696
-
https://arxiv.org/pdf/2408.15857
-
https://arxiv.org/html/2410.17725v1#S3