Understanding the Model Architecture
Before improving our model, let's understand its structure.
The model you're using is a simplified
PointNet. Surprisingly, we can understand
everything by just studying the very simple (~20 actual lines of code)
_TNet
class.
This architecture consists of three main components:
-
Feature Extraction:
The input tensor has shape(B, C, L)
, whereB
is the batch size,C
is the number of channels (input features), andL
is the number of points.The feature extractor applies a series of 1D convolution layers to represent the input points in a higher-dimensional space of abstract features.
Let's see a minimal example:
import torch import torch.nn as nn x = torch.randn(2, 3, 4) print("Input shape:", x.shape) net = nn.Conv1d(3, 5, 1) x = net(x) print("Output shape:", x.shape)
Note that the feature extractor also includes batch normalization and ReLU activation layers, which are essential for training deep neural networks.
Activity:
Change thenet
in the example above tonn.BatchNorm1d(3)
ornn.ReLU()
, and manually compute what these layers would do to the input tensor (printx
before and after the operation to verify your calculations).Together, these layers make up the entire feature extractor. Just basic operations, chained one after another. Once you break it down, there's no mystery!
-
Global Features Aggregation:
Once each point has been mapped to a higher-dimensional feature vector, we need to summarize the entire collection of points into a single, fixed-size representation.This is done using a simple operation known as max pooling. It takes the maximum value across all points for each feature, resulting in a single vector that captures the most significant features of the entire point cloud.
Activity:
Change thenet(x)
in the example above to anx.max(dim=2).values
transformation and check the output values and shape.Note that this operation is inherently order-invariant, making it suitable for point clouds, where the order of points doesn't matter.
-
Fully Connected Regressor:
After pooling, we are left with a single(B, F)
tensor; one global feature vector per batch. The final step is to map this to our final prediction.This is done by the regressor block, a series of fully connected linear layers, each followed by batch normalization and ReLU activation.
Activity:
Try passing a(2, 1024)
tensor through annn.Linear(1024, 1)
layer and then aReLU()
to see how this maps feature vectors toward output predictions.
You now understand the full architecture of _TNet
. This structure is compact
but powerful, and it is nearly the complete architecture of our model.
Activity:
Open the full model and compare the_TNet
class to the fullRegressor
class.
What is the difference between them?
Try sketching the full model using a diagram or by describing it in your own words.