wsipipe.preprocess.sample package
sampler module
Samplers apply different sampling policies to patchsets.
- balanced_sample(patches, num_samples, floor_samples=1000, sampling_policy=<function simple_random>)[source]
Creates a balanced sample with the same number of patches of different classes
Gets the total number of patches per class. Set the number of patches per class to the total number of patches in the smallest class. If the number of patches in the smallest class is greater than the requested number of patches per class it returns the requested number of patches per class, otherwise it returns the number of patches in the smallest class. If one class is much smaller than all the others the floor sample number gives the minimum number of patches that will be returned for all classes that have more patches than that. For example if one class had only 50 patches and the others all had more than the floor samples of 1000, all classes would return 1000 patches apart from the small class which would return 50, without this all classes would be limited to 50 patches. Different sampling policies can then be applied to select that number of patches from the overall patchset, for example random, random with replacement or weighted random.
- Parameters
patches (PatchSet) – A PatchSet
num_samples (int) – The requested number of patches per class
floor_samples (int, optional) – The minimum number of samples for large classes. Defaults to 1000
sampling_policy (Callable, optional) – Defaults to simple_random
- Returns
A patchset containing a balanced sample of patches
- Return type
(Patchset)
- simple_random(class_df, sum_totals)[source]
Takes a random sample without replacement from a dataframe of a single class
- Parameters
class_df (pandas.DataFrame) –
sum_totals (int) –
- Return type
pandas.DataFrame
- simple_random_replacement(class_df, sum_totals)[source]
Takes a random sample with replacement from a dataframe of a single class
- Parameters
class_df (pandas.DataFrame) –
sum_totals (int) –
- Return type
pandas.DataFrame
- slide_weighted_random(class_df, sum_totals)[source]
Takes a sample weighted per slide Weights inverse to the number of samples per slide Should return approximately the same number of patches per slide, even if some slides have many more patches than others. Samples with replacement
- Parameters
class_df (pandas.DataFrame) –
sum_totals (int) –
- Return type
pandas.DataFrame