clustering utils

A collection of functions used for performing clustering tasks

This collection of tools is a little deprecated at this moment but kept for reference; it contains functionality for pre-filtering the histograms in the training set based on their moments (e.g. mean, rms).
Note that the functions here have not been used in a long time and might need some maintenance before they work properly again.


vecdist

full signature:

def vecdist(moments, index)  

comments:

calculate the vectorial distance between a set of moments  
input arguments:  
- moments: 2D numpy array of shape (ninstances,nmoments)  
- index: index for which instance to calculate the distance relative to the other instances  
returns:  
- a distance measure for the given index w.r.t. the other instances in 'moments'  
notes:  
- for this distance measure, the points are considered as vectors and the point at index is the origin.  
  with respect to this origin, the average vector before index and the average vector after index are calculated.  
  the distance is then defined as the norm of the difference of these vectors,   
  normalized by the norms of the individual vectors.  

costhetadist

full signature:

def costhetadist(moments, index)  

comments:

calculate the costheta distance between a set of moments  
input arguments:  
- moments: 2D numpy array of shape (ninstances,nmoments)  
- index: index for which instance to calculate the distance relative to the other instances  
returns:  
- a distance measure for the given index w.r.t. the other instances in 'moments'  
notes:  
- this distance measure takes the cosine of the angle between the point at index  
  and the one at index-1 (interpreted as vectors from the origin).  

avgnndist

full signature:

def avgnndist(moments, index, nn)  

comments:

calculate average euclidean distance to neighbouring points  
input arguments:  
- moments: 2D numpy array of shape (ninstances,nmoments)  
- index: index for which instance to calculate the distance relative to the other instances  
- nn: (half-) window size  
returns:  
- a distance measure for the given index w.r.t. the other instances in 'moments'  
notes:  
- for this distance measure, the average euclidean distance is calculated between the point at 'index'  
  and the points at index-nn and index+nn (e.g. the nn previous and next lumisections).  

getavgnndist

full signature:

def getavgnndist(hists, nmoments, xmin, xmax, nbins, nneighbours)  

comments:

apply avgnndist to a set of histograms  

filteranomalous

full signature:

def filteranomalous(df, nmoments=3, rmouterflow=True, rmlargest=0., doplot=True)  

comments:

do a pre-filtering, removing the histograms with anomalous moments