wsipipe.preprocess.sample package

sampler module

Samplers apply different sampling policies to patchsets.

balanced_sample(patches, num_samples, floor_samples=1000, sampling_policy=<function simple_random>)[source]

Creates a balanced sample with the same number of patches of different classes

Gets the total number of patches per class. Set the number of patches per class to the total number of patches in the smallest class. If the number of patches in the smallest class is greater than the requested number of patches per class it returns the requested number of patches per class, otherwise it returns the number of patches in the smallest class. If one class is much smaller than all the others the floor sample number gives the minimum number of patches that will be returned for all classes that have more patches than that. For example if one class had only 50 patches and the others all had more than the floor samples of 1000, all classes would return 1000 patches apart from the small class which would return 50, without this all classes would be limited to 50 patches. Different sampling policies can then be applied to select that number of patches from the overall patchset, for example random, random with replacement or weighted random.

Parameters
  • patches (PatchSet) – A PatchSet

  • num_samples (int) – The requested number of patches per class

  • floor_samples (int, optional) – The minimum number of samples for large classes. Defaults to 1000

  • sampling_policy (Callable, optional) – Defaults to simple_random

Returns

A patchset containing a balanced sample of patches

Return type

(Patchset)

simple_random(class_df, sum_totals)[source]

Takes a random sample without replacement from a dataframe of a single class

Parameters
  • class_df (pandas.DataFrame) –

  • sum_totals (int) –

Return type

pandas.DataFrame

simple_random_replacement(class_df, sum_totals)[source]

Takes a random sample with replacement from a dataframe of a single class

Parameters
  • class_df (pandas.DataFrame) –

  • sum_totals (int) –

Return type

pandas.DataFrame

slide_weighted_random(class_df, sum_totals)[source]

Takes a sample weighted per slide Weights inverse to the number of samples per slide Should return approximately the same number of patches per slide, even if some slides have many more patches than others. Samples with replacement

Parameters
  • class_df (pandas.DataFrame) –

  • sum_totals (int) –

Return type

pandas.DataFrame