SoDDA is a soft-clustering scheme for classifying galaxies using the 4 emission-line ratios [NII]/Ha, [SII]/Ha, [OI]/Ha, and [OIII]/Hb. It fits several multivariate Gaussians to the 4-D distribution of observed data to capture local structures, which are then grouped to represent the complex multi-dimensional structure of joint distribution of galaxies in the 4-D line ratio space.
A guide to the contents:
The classification.py file contains the following sample functions :
1) compute_prob(x, weights, means, covars, clusters = clusters) :
x : 1x4 numpy array containing the ratios
LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OI/H_ALPHA),
LOG(OIII/H_BETA) respectively
weights: the weights of the 20 subpopulations,
contained in the supplementary material
means: the means of the 20 subpopulations, contained
in the supplementary material
covers: the covariances of the 20 subpopulations,
contained in the supplementary material
clusters: the allocation of the 20 subpopulations
RETURNS: the posterior probabilities of a galaxy
belonging to each activity class, SFG, Seyferts,
LINERs and Composites.
Example:
import scipy.stats as stats
import numpy as np
import pandas as pd
means = np.load("means.npy")
weights = np.load("weights.npy")
covars = np.load("covars.npy")
data = pd.DataFrame(np.array([299491051364706304, 6, -0.525441, -0.556073, -1.623533, -0.621178]).reshape(1,6), columns = ["SPECOBJID", "Index", "LOG(NII/H_ALPHA)", "LOG(SII/H_ALPHA)", "LOG(OI/H_ALPHA)", "LOG(OIII/H_BETA)"])
posterior_prob = compute_prob(data.iloc[0, 2:], weights, means, covars)
2) svm_classification_4d(x, svc_4d_coef, svc_4d_inter):
x : 1x4 numpy array containing the ratios
LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OI/H_ALPHA),
LOG(OIII/H_BETA) respectively
svc_4d_coef: the 4-dimensional SVM coefficients,
contained in the supplementary material
svc_4d_inter: The 4-dimensional SVM intercepts,
contained in the supplementary material
RETURNS: the classification of a galaxy according to
4-dimensional SVM (0 for SFG, 1 for Seyferts, 2 for
LINERs, 3 for Composites, 5 for undefined; Note for
the undefined case: we used the scikit library for
training the SVM and the one-vs-rest , `ovr', decision
function. This kind of decision function can lead to
specific regions in the 4-dimensional space that are
not covered by any of the 4 classes. Scikit approach
for those points are to classify based on the distance
to the boundaries. The trained SVM model can be used
for this purpose which is also included in the online
version as svm_4d.sav.
Example:
import numpy as np
from sklearn import svm
import pickle
data = pd.DataFrame(np.array([299491051364706304, 6, -0.525441, -0.556073, -1.623533, -0.621178]).reshape(1,6), columns = ["SPECOBJID", "Index", "LOG(NII/H_ALPHA)", "LOG(SII/H_ALPHA)", "LOG(OI/H_ALPHA)", "LOG(OIII/H_BETA)"])
svc_4d_inter = np.load("svm_4d_intercept.npy")
svc_4d_coef = np.load("svm_4d_coefs.npy")
predicted_class = svm_classification_4d(data.iloc[0, 2:], svc_4d_coef, svc_4d_inter)
print("class based on 4d SVM = " + str(predicted_class))
# Use the trained classifier
filename = 'svm_4d.sav'
svc = pickle.load(open(filename, 'rb'))
predicted_class = svc.predict(data.iloc[0, 2:].values.reshape(1,4))
print("class based on 4d trained SVM classifier = " + str(predicted_class))
3) svm_classification_3d(x, svc_3d_coef, svc_3d_inter):
x : 1x3 numpy array containing the ratios
LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OIII/H_BETA)
respectively
svc_3d_coef: the 3-dimensional SVM coefficients,
contained in the supplementary material
svc_3d_inter: The 3-dimensional SVM intercepts,
contained in the supplementary material
RETURNS: the classification of a galaxy according to
3-dimensional SVM (0 for SFG, 1 for Seyferts, 2 for
LINERs, 3 for Composites, 5 for undefined (see
svm_classification_4d for an explanation of undefined)
The entire set of programs and example datasets and associated subroutines may be downloaded as the tar file, hea-www.harvard.edu/SoDDA/SoDDA.tar.gz