Exploring the Data

Each annihilation event produces a number of charged particles that leave a trail of hits in the ALPHA-g detector. We have prepared a dataset with 100,000 events:

/fast_scratch_1/TRISEP_data/AdvancedTutorial/small_dataset.parquet

Each event in the dataset contains:

A true vertex position: (x, y, z) i.e. the origin of the annihilation.
A set of 3D hit positions: [(x1, y1, z1), ..., (xn, yn, zn)].

You can get a quick overview of the dataset by launching a Python interpreter and running the following code:

>>> import polars as pl
>>> df = pl.read_parquet("/fast_scratch_1/TRISEP_data/AdvancedTutorial/small_dataset.parquet")
>>> print(df)

"""
shape: (100_000, 2)
┌─────────────────────────────────┬─────────────────────────────────┐
│ target                          ┆ point_cloud                     │
│ ---                             ┆ ---                             │
│ array[f32, 3]                   ┆ list[array[f32, 3]]             │
╞═════════════════════════════════╪═════════════════════════════════╡
│ [1.455413, 15.901725, -571.578… ┆ [[46.253021, 175.486893, -558.… │
│ [22.550814, 3.005712, 834.1053… ┆ [[171.600311, -59.034386, 1070… │
│ [4.511479, -8.75235, -1014.025… ┆ [[33.221375, -178.431686, -106… │
│ [-9.729183, 4.537313, -970.073… ┆ [[33.232174, -178.489685, -110… │
│ [16.184988, -9.351818, -66.521… ┆ [[152.121017, -98.965576, -218… │
│ …                               ┆ …                               │
│ [5.014961, -12.403949, 62.5900… ┆ [[-158.989197, -87.506714, 226… │
│ [-7.504006, -18.486027, 689.96… ┆ [[-181.138474, 11.128488, -30.… │
│ [-19.474358, 13.368332, -59.90… ┆ [[-120.215218, -135.953278, -1… │
│ [-8.580782, 1.076302, -866.250… ┆ [[15.571182, -180.818787, -846… │
│ [-18.750277, 6.330079, 969.173… ┆ [[176.59346, -41.938202, 798.0… │
└─────────────────────────────────┴─────────────────────────────────┘
"""

Before training any model, it's important to understand the structure and characteristics of the data.

Activity:

Where do annihilation events occur? A skewed distribution in vertex z positions might cause the model to "cheat" by always guessing the most common region.

Do all events have the same number of hits? Variable-length point clouds will require special handling in the model architecture.

To help you answer these questions, we've provided the script: AdvancedTutorial/code/visualization.py.

You can run it directly to visualize key properties of a dataset:

# Target z distribution
python visualize.py target-z /path/to/dataset.parquet

# Point cloud size distribution
python visualize.py cloud-size /path/to/dataset.parquet

Iterating Through the Dataset with PyTorch

To train a model, we need to iterate through the dataset. PyTorch provides a primitive torch.utils.data.Dataset class that allows us to decouple the data loading from the model training/batching process.

We've provided a PyTorch-compatible dataset class. It wraps a .parquet file and gives you easy access to the data in a PyTorch-friendly way. Create a new Python script in the AdvancedTutorial/code/ directory:

from data.dataset import PointCloudDataset

config = {"cloud_size": 140}
dataset = PointCloudDataset(
    "/fast_scratch_1/TRISEP_data/AdvancedTutorial/small_dataset.parquet", config
)

index = 0  # First event
point_cloud, target = dataset[index]

Try running the code above and plot some point clouds and their corresponding targets (annihilation vertices).

Activity:
Inspect the PointCloudDataset class. How does it handle variable-length point clouds?
Using the first 10 events (indices 0-9), plot the point clouds and their targets. Do they look like you expected?
You can make a 3D scatter plot using matplotlib:
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(projection="3d")

ax.scatter(point_cloud[0], point_cloud[1], point_cloud[2])
ax.scatter(0, 0, target.item(), color="red")
Using the next 10 events (indices 10-19), plot the point clouds without their targets. Make an educated guess about the target vertex position based on the point cloud. Compare your guess with the actual target positions.

Keyboard shortcuts

ALPHA-g Vertex Reconstruction

Exploring the Data

Iterating Through the Dataset with PyTorch