wsipipe.datasets package

Datasets contain information on sets of data, e.g file locations, number of slides, labels etc A dataset is a dataframe with columns slide, annotation, label and tags

slide contains WSI path

annotation contains path to annotation file or slide label

label contains slide level labels

tags is any other infomation about the slide (multiple pieces of data are separated by semi colons).

camelyon16 module

This module creates the dataframe for the camelyon 16 dataset with the follwing columns:

The slide column stores the paths on disk of the whole slide images.
The annotation column records a path to the annotation files.
The label column is the slide level label.
The tags column is blank for camelyon 16.

This assumes there is a folder on disk structured the same as downloading from the camelyon grand challenge Camelyon 16 google drive: https://camelyon17.grand-challenge.org/Data/

testing(cam16_path=PosixPath('data/camelyon16'), project_root=None)[source]

Create Camleyon 16 testing dataset

This function goes through the input directories for the testing slides, and matches up the annotations and slides. It creates a dataframe with slide path with matching annotation path, and slide label. There is an empty tags column that is not used for this dataset

Parameters:

cam16_path (Path, optional) – a path relative to the project root that is the location of the Camelyon 16 data. Defaults to data/camelyon16.
project_root (Optional[Path]) –

Returns:

A dataframe with columns slide, annotation, label and tags

Return type:

df (pd.DataFrame)

training(cam16_path=PosixPath('data/camelyon16'), project_root=None)[source]

Create Camleyon 16 training dataset

This function goes through the input directories for the training slides, and matches up the annotations and slides. It creates a dataframe with slide path with matching annotation path, and slide label. There is an empty tags column that is not used for this dataset

Parameters:

cam16_path (Path, optional) – a path relative to the project root that is the location of the Camelyon 16 data. Defaults to data/camelyon16.
project_root (Optional[Path]) –

Returns:

A dataframe with columns slide, annotation, label and tags

Return type:

df (pd.DataFrame)

stripai module

This module creates the dataframe for the STRIP AI dataset with the following columns:

The slide column stores the paths on disk of the whole slide images
The annotation column records a string with the slide label
The label column is the slide level label
The tags column contains the center and patient for each slide

This assumes there is a folder on disk structured the same as downloading from the kaggle website https://www.kaggle.com/competitions/mayo-clinic-strip-ai/data

convert_to_pyramids(data_root=PosixPath('data/mayo-clinic-strip-ai'), out_root=PosixPath('experiments/mayo_pyramids'), project_root=None)[source]

Create pyramids for whole slide images

The whole slide images as downloaded only contain data at level 0, no other levels are present. This can make it slow to access the slides. This function will run over all the slides in the the dataset and write out copies that contain a pyramid of levels. Files are written to folder experiments/pyramids/

Parameters:

mayo_path (Path, optional) – a path relative to the project root that is the location of the strip ai data. Defaults to data/mayo-clinic-strip-ai.
data_root (Path) –
out_root (Path) –
project_root (Optional[Path]) –

training(data_root=PosixPath('data/mayo-clinic-strip-ai'), project_root=None)[source]

Create Strip AI training dataset

This function goes through the input directories for the training slides, and matches up the slide paths with infomation in the csv It creates a dataframe with slide path with matching slide label stored for both label and annotation. The tags column stores the patient id and center id.

Parameters:

mayo_path (Path, optional) – a path relative to the project root that is the location of the stripai data. Defaults to data/mayo-clinic-strip-ai.
data_root (Path) –
project_root (Optional[Path]) –

Returns:

A dataframe with columns slide, annotation, label and tags

Return type:

df (pd.DataFrame)

dataset_utils module

sample_dataset(df, samples_per_class)[source]

Create a subset of a dataset dataframe This function will create a smaller dataframe that only includes n slides per class. This can be used to create smaller datasets for example for debugging pipelines

Parameters:

df (pd.DataFrame) – A dataframe containing a column called label
samples_per_class (str) – The number of slides per class to return

Returns:

A copy of the dataframe with samples_per_class rows: for each label

Return type:

df (pd.DataFrame)