Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Exploring the Data

Each annihilation event produces a number of charged particles that leave a trail of hits in the ALPHA-g detector. We have prepared a dataset with 100,000 events:

/fast_scratch_1/TRISEP_data/AdvancedTutorial/small_dataset.parquet

Each event in the dataset contains:

  • A true vertex position: (x, y, z) i.e. the origin of the annihilation.
  • A set of 3D hit positions: [(x1, y1, z1), ..., (xn, yn, zn)].

You can get a quick overview of the dataset by launching a Python interpreter and running the following code:

>>> import polars as pl
>>> df = pl.read_parquet("/fast_scratch_1/TRISEP_data/AdvancedTutorial/small_dataset.parquet")
>>> print(df)

"""
shape: (100_000, 2)
┌─────────────────────────────────┬─────────────────────────────────┐
│ target                          ┆ point_cloud                     │
│ ---                             ┆ ---                             │
│ array[f32, 3]                   ┆ list[array[f32, 3]]             │
╞═════════════════════════════════╪═════════════════════════════════╡
│ [1.455413, 15.901725, -571.578… ┆ [[46.253021, 175.486893, -558.… │
│ [22.550814, 3.005712, 834.1053… ┆ [[171.600311, -59.034386, 1070… │
│ [4.511479, -8.75235, -1014.025… ┆ [[33.221375, -178.431686, -106… │
│ [-9.729183, 4.537313, -970.073… ┆ [[33.232174, -178.489685, -110… │
│ [16.184988, -9.351818, -66.521… ┆ [[152.121017, -98.965576, -218… │
│ …                               ┆ …                               │
│ [5.014961, -12.403949, 62.5900… ┆ [[-158.989197, -87.506714, 226… │
│ [-7.504006, -18.486027, 689.96… ┆ [[-181.138474, 11.128488, -30.… │
│ [-19.474358, 13.368332, -59.90… ┆ [[-120.215218, -135.953278, -1… │
│ [-8.580782, 1.076302, -866.250… ┆ [[15.571182, -180.818787, -846… │
│ [-18.750277, 6.330079, 969.173… ┆ [[176.59346, -41.938202, 798.0… │
└─────────────────────────────────┴─────────────────────────────────┘
"""

Before training any model, it's important to understand the structure and characteristics of the data.

Activity:

  • Where do annihilation events occur? A skewed distribution in vertex z positions might cause the model to "cheat" by always guessing the most common region.
  • Do all events have the same number of hits? Variable-length point clouds will require special handling in the model architecture.

To help you answer these questions, we've provided the script: AdvancedTutorial/code/visualization.py.

You can run it directly to visualize key properties of a dataset:

# Target z distribution
python visualize.py target-z /path/to/dataset.parquet

# Point cloud size distribution
python visualize.py cloud-size /path/to/dataset.parquet

Iterating Through the Dataset with PyTorch

To train a model, we need to iterate through the dataset. PyTorch provides a primitive torch.utils.data.Dataset class that allows us to decouple the data loading from the model training/batching process.

We've provided a PyTorch-compatible dataset class. It wraps a .parquet file and gives you easy access to the data in a PyTorch-friendly way. Create a new Python script in the AdvancedTutorial/code/ directory:

from data.dataset import PointCloudDataset

config = {"cloud_size": 140}
dataset = PointCloudDataset(
    "/fast_scratch_1/TRISEP_data/AdvancedTutorial/small_dataset.parquet", config
)

index = 0  # First event
point_cloud, target = dataset[index]

Try running the code above and plot some point clouds and their corresponding targets (annihilation vertices).

Activity:

  • Inspect the PointCloudDataset class. How does it handle variable-length point clouds?
  • Using the first 10 events (indices 0-9), plot the point clouds and their targets. Do they look like you expected?
    You can make a 3D scatter plot using matplotlib:
    import matplotlib.pyplot as plt
    
    fig = plt.figure()
    ax = fig.add_subplot(projection="3d")
    
    ax.scatter(point_cloud[0], point_cloud[1], point_cloud[2])
    ax.scatter(0, 0, target.item(), color="red")
    
  • Using the next 10 events (indices 10-19), plot the point clouds without their targets. Make an educated guess about the target vertex position based on the point cloud. Compare your guess with the actual target positions.