generate data utils
A collection of functions for artificially creating a labeled dataset.
See the function documentation below for more details on the implemented methods.
Also check the tutorial generate_data.ipynb for examples!
goodnoise
full signature:
def goodnoise(nbins, fstd=None)
comments:
generate one sample of 'good' noise consisting of fourier components
input args:
- nbins: number of bins, length of noise array to be sampled
- fstd: an array of length nbins used for scaling of the amplitude of the noise
bin-by-bin.
output:
- numpy array of length nbins containing the noise
badnoise
full signature:
def badnoise(nbins, fstd=None)
comments:
generate one sample of 'bad' noise consisting of fourier components
(higher frequency and amplitude than 'good' noise)
input args and output: simlar to goodnoise
WARNING: NOT NECESSARILY REPRESENTATIVE OF ANOMALIES TO BE EXPECTED, DO NOT USE
whitenoise
full signature:
def whitenoise(nbins, fstd=None)
comments:
generate one sample of white noise (uncorrelated between bins)
input args and output: similar to goodnoise
random_lico
full signature:
def random_lico(hists)
comments:
generate one linear combination of histograms with random coefficients in (0,1) summing to 1
input args:
- numpy array of shape (nhists,nbins), the rows of which will be linearly combined
output:
- numpy array of shape (nbins), containing the new histogram
smoother
full signature:
def smoother(inarray, halfwidth=1)
comments:
smooth the rows of a 2D array using the 2*halfwidth+1 surrounding values.
mse_correlation_vector
full signature:
def mse_correlation_vector(hists, index)
comments:
calculate mse of a histogram at given index wrt all other histograms
input args:
- hists: numpy array of shape (nhists,nbins) containing the histograms
- index: the index (must be in (0,len(hists)-1)) of the histogram in question
output:
- numpy array of length nhists containing mse of the indexed histogram with respect to all other histograms
WARNING: can be slow if called many times on a large collection of histograms with many bins.
moments_correlation_vector
full signature:
def moments_correlation_vector(moments, index)
comments:
calculate moment distance of hist at index wrt all other hists
very similar to mse_correlation_vector but using histogram moments instead of full histograms for speed-up
plot_data_and_gen
full signature:
def plot_data_and_gen(datahists, genhists, fig=None, axs=None, datacolor='b', gencolor='b', datalabel='Histograms from data', genlabel='Artificially generated histograms')
comments:
plot a couple of random examples from data and generated histograms
note: both are plotted in different subplots of the same figure
input arguments:
- datahists, genhists: numpy arrays of shape (nhists,nbins)
- fig, axs: a matplotlib figure object and a list of two axes objects
(if either is None, a new figure with two subplots will be created)
plot_seed_and_gen
full signature:
def plot_seed_and_gen(seedhists, genhists, fig=None, ax=None, seedcolor='b', gencolor='g', seedlabel='Histograms from data', genlabel='Artificially generated histograms')
comments:
plot seed and generated histograms
note: both are plotted in the same subplot
input arguments:
- seedhists, genhists: numpy arrays of shape (nhists,nbins)
- fig, ax: a matplotlib figure object and an axes object
(if either is None, a new figure will be created)
plot_noise
full signature:
def plot_noise(noise, fig=None, ax=None, noiselabel='Examples of noise', noisecolor='b', histstd=None, histstdlabel='Variation')
comments:
plot histograms in noise (numpy array of shape (nhists,nbins))
input arguments:
- noise: 2D numpy array of shape (nexamples,nbins)
- fig, ax: a matplotlib figure object and an axes object
(if either is None, a new figure will be created)
- noiselabel: label for noise examples (use None to not add a legend entry for noise)
- noisecolor: color for noise examples on plot
- histstd: 1D numpy array of shape (nbins) displaying some order-of-magnitude allowed variation
(typically some measure of per-bin variation in the input histogram(s))
- histstdlabel: label for histstd (use None to not add a legend entry for histstd)
fourier_noise_on_mean
full signature:
def fourier_noise_on_mean(hists, outfilename='', nresamples=0, nonnegative=True, doplot=True)
comments:
apply fourier noise on the bin-per-bin mean histogram, with amplitude scaling based on bin-per-bin std histogram.
input args:
- hists: numpy array of shape (nhists,nbins) used for determining mean and std
- outfilename: path to csv file to write results to (default: no writing)
- nresamples: number of samples to draw (default: number of input histograms / 10)
- nonnegative: boolean whether to set all bins to minimum zero after applying noise
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
MOSTLY SUITABLE AS HELP FUNCTION FOR RESAMPLE_SIMILAR_FOURIER_NOISE, NOT AS GENERATOR IN ITSELF
advantages: mean histogram is almost certainly 'good' because of averaging, eliminate bad histograms
disadvantages: deviations from mean are small, does not model systematic shifts by lumi.
fourier_noise
full signature:
def fourier_noise(hists, outfilename='', nresamples=1, nonnegative=True, stdfactor=15., doplot=True)
comments:
apply fourier noise on random histograms with simple flat amplitude scaling.
input args:
- hists: numpy array of shape (nhists,nbins) used for seeding
- outfilename: path to csv file to write results to (default: no writing)
- nresamples: number of samples to draw per input histogram
- nonnegative: boolean whether to set all bins to minimum zero after applying noise
- stdfactor: factor to scale magnitude of noise (larger factor = smaller noise)
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
advantages: resampled histograms will have statistically same features as original input set
disadvantages: also 'bad' histograms will be resampled if included in hists
upsample_hist_set
full signature:
def upsample_hist_set(hists, ntarget=-1, fourierstdfactor=15., doplot=True)
comments:
wrapper for fourier_noise allowing for a fixed target number of histograms instead of a fixed resampling factor.
useful function for quickly generating a fixed number of resampled histograms,
without bothering too much about what exact resampling technique or detailed settings would be most appropriate.
input arguments:
- hists: input histogram set
- ntarget: targetted number of resampled histograms (default: equally many as in hists)
- fourierstdfactor: see fourier_noise
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
white_noise
full signature:
def white_noise(hists, stdfactor=15., doplot=True)
comments:
apply white noise to the histograms in hists.
input args:
- hists: np array (nhists,nbins) containing input histograms
- stdfactor: scaling factor of white noise amplitude (higher factor = smaller noise)
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
resample_bin_per_bin
full signature:
def resample_bin_per_bin(hists, outfilename='', nresamples=0, nonnegative=True, smoothinghalfwidth=2, doplot=True)
comments:
do resampling from bin-per-bin probability distributions
input args:
- hists: np array (nhists,nbins) containing the histograms to draw new samples from
- outfilename: path to csv file to write results to (default: no writing)
- nresamples: number of samples to draw (default: 1/10 of number of input histograms)
- nonnegative: boolean whether or not to put all bins to minimum zero after applying noise
- smoothinghalfwidth: halfwidth of smoothing procedure to apply on the result (default: no smoothing)
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
advantages: no arbitrary noise modeling
disadvantages: bins are considered independent, shape of historams not taken into account,
does not work well on small number of input histograms,
does not work well on histograms with systematic shifts
resample_similar_bin_per_bin
full signature:
def resample_similar_bin_per_bin( allhists, selhists, outfilename='', nresamples=1, nonnegative=True, keeppercentage=1., doplot=True)
comments:
resample from bin-per-bin probability distributions, but only from similar looking histograms.
input args:
- allhists: np array (nhists,nbins) containing all available histograms (to determine mean)
- selhists: np array (nhists,nbins) conataining selected histograms used as seeds (e.g. 'good' histograms)
- outfilename: path of csv file to write results to (default: no writing)
- nresamples: number of samples per input histogram in selhists
- nonnegative: boolean whether or not to put all bins to minimum zero after applying noise
- keeppercentage: percentage (between 1 and 100) of histograms in allhists to use per input histogram
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
advantages: no assumptions on shape of noise,
can handle systematic shifts in histograms
disadvantages: bins are treated independently from each other
resample_similar_fourier_noise
full signature:
def resample_similar_fourier_noise( allhists, selhists, outfilename='', nresamples=1, nonnegative=True, keeppercentage=1., doplot=True)
comments:
apply fourier noise on mean histogram,
where the mean is determined from a set of similar-looking histograms
input args:
- allhists: np array (nhists,nbins) containing all available histograms (to determine mean)
- selhists: np array (nhists,nbins) conataining selected histograms used as seeds (e.g. 'good' histograms)
- outfilename: path of csv file to write results to (default: no writing)
- nresamples: number of samples per input histogram in selhists
- nonnegative: boolean whether or not to put all bins to minimum zero after applying noise
- keeppercentage: percentage (between 1 and 100) of histograms in allhists to use per input histogram
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
advantages: most of fourier_noise_on_mean but can additionally handle shifting histograms,
apart from fourier noise, also white noise can be applied.
disadvantages: does not filter out odd histograms as long as enough other odd histograms look more or less similar
resample_similar_lico
full signature:
def resample_similar_lico( allhists, selhists, outfilename='', nresamples=1, nonnegative=True, keeppercentage=1., doplot=True)
comments:
take linear combinations of similar histograms
input arguments:
- allhists: 2D np array (nhists,nbins) with all available histograms, used to take linear combinations
- selhists: 2D np array (nhists,nbins) with selected hists used for seeding (e.g. 'good' histograms)
- outfilename: path to csv file to write result to (default: no writing)
- nresamples: number of combinations to make per input histogram
- nonnegative: boolean whether to make all final histograms nonnegative
- keeppercentage: percentage (between 0. and 100.) of histograms in allhists to use per input histogram
- doplot: boolean whether to make a plot
returns:
a tuple of the form (resulting histograms, maplotlib figure, matplotlib axes),
figure and axes are None if doplot was set to False
advantages: no assumptions on noise
disadvantages: sensitive to outlying histograms (more than with averaging)
mc_sampling
full signature:
def mc_sampling(hists, nMC=10000 , nresamples=10, doplot=True)
comments:
resampling of a histogram using MC methods
Drawing random points from a space defined by the range of the histogram in all axes.
Points are "accepted" if the fall under the sampled histogram:
f(x) - sampled distribution
x_r, y_r -> randomly sampled point
if y_r<=f(x_r), fill the new distribution at bin corresponding to x_r with weight:
weight = (sum of input hist)/(#mc points accepted)
this is equal to
weight = (MC space volume)/(all MC points)