`SoDDA`

Soft Data Driven Allocation Classification

hea-www.harvard.edu/AstroStat/SoDDA

| Description | ReadMe | Download |

Last Updated: 2019aug02

Description

SoDDA is a soft-clustering scheme for classifying galaxies using the 4 emission-line ratios [NII]/Ha, [SII]/Ha, [OI]/Ha, and [OIII]/Hb. It fits several multivariate Gaussians to the 4-D distribution of observed data to capture local structures, which are then grouped to represent the complex multi-dimensional structure of joint distribution of galaxies in the 4-D line ratio space.

The method is described in detail in Stampoulis, V., van Dyk, D.A., Kashyap, V.L., & Zezas, A., MNRAS, Multidimensional Data Driven Classification of Emission-line Galaxies, 2019, MNRAS, accepted: Manuscript [.pdf]; arXiv:1802.02133 [url]

ReadMe

A guide to the contents:

The classification.py file contains the following sample functions :

1) compute_prob(x, weights, means, covars, clusters = clusters) :
	x : 1x4 numpy array containing the ratios
	  LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OI/H_ALPHA),
	  LOG(OIII/H_BETA) respectively
	weights: the weights of the 20 subpopulations,
	  contained in the supplementary material
	means: the means of the 20 subpopulations, contained
	  in the supplementary material
	covers: the covariances of the 20 subpopulations,
	  contained in the supplementary material
	clusters:  the allocation of the 20 subpopulations

	RETURNS: the posterior probabilities of a galaxy
	  belonging to each activity class, SFG, Seyferts,
	  LINERs and Composites.

Example: 

import scipy.stats as stats
import numpy as np
import pandas as pd
means = np.load("means.npy")
weights = np.load("weights.npy")
covars = np.load("covars.npy")
data = pd.DataFrame(np.array([299491051364706304, 6, -0.525441, -0.556073, -1.623533, -0.621178]).reshape(1,6), columns = ["SPECOBJID", "Index", "LOG(NII/H_ALPHA)", "LOG(SII/H_ALPHA)", "LOG(OI/H_ALPHA)", "LOG(OIII/H_BETA)"])           
posterior_prob = compute_prob(data.iloc[0, 2:], weights, means, covars)
	

2) svm_classification_4d(x, svc_4d_coef, svc_4d_inter):
	x : 1x4 numpy array containing the ratios
	  LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OI/H_ALPHA),
	  LOG(OIII/H_BETA) respectively
	svc_4d_coef: the 4-dimensional SVM coefficients,
	  contained in the supplementary material
	svc_4d_inter: The 4-dimensional SVM intercepts,
	  contained in the supplementary material
	
	RETURNS: the classification of a galaxy according to
	  4-dimensional SVM (0 for SFG, 1 for Seyferts, 2 for
	  LINERs, 3 for Composites, 5 for undefined; Note for
	  the undefined case: we used the scikit library for
	  training the SVM and the one-vs-rest , `ovr', decision
	  function. This kind of decision function can lead to
	  specific regions in the 4-dimensional space that are
	  not covered by any of the 4 classes. Scikit approach
	  for those points are to classify based on the distance
	  to the boundaries. The trained SVM model can be used
	  for this purpose which is also included in the online
	  version as svm_4d.sav.


Example:


import numpy as np
from sklearn import svm
import pickle
data = pd.DataFrame(np.array([299491051364706304, 6, -0.525441, -0.556073, -1.623533, -0.621178]).reshape(1,6), columns = ["SPECOBJID", "Index", "LOG(NII/H_ALPHA)", "LOG(SII/H_ALPHA)", "LOG(OI/H_ALPHA)", "LOG(OIII/H_BETA)"])           
svc_4d_inter = np.load("svm_4d_intercept.npy")
svc_4d_coef = np.load("svm_4d_coefs.npy")
predicted_class = svm_classification_4d(data.iloc[0, 2:], svc_4d_coef, svc_4d_inter)
print("class based on 4d SVM = " + str(predicted_class))

# Use the trained classifier
filename = 'svm_4d.sav'
svc = pickle.load(open(filename, 'rb'))
predicted_class = svc.predict(data.iloc[0, 2:].values.reshape(1,4))
print("class based on 4d trained SVM classifier = " + str(predicted_class))


3) svm_classification_3d(x, svc_3d_coef, svc_3d_inter):
	x : 1x3 numpy array containing the ratios
	  LOG(NII/H_ALPHA), LOG(SII/H_ALPHA), LOG(OIII/H_BETA)
	  respectively
	svc_3d_coef: the 3-dimensional SVM coefficients,
	  contained in the supplementary material
	svc_3d_inter: The 3-dimensional SVM intercepts,
	  contained in the supplementary material
	
	RETURNS: the classification of a galaxy according to
	  3-dimensional SVM (0 for SFG, 1 for Seyferts, 2 for
	  LINERs, 3 for Composites, 5 for undefined (see
	  svm_classification_4d for an explanation of undefined)

Download

The entire set of programs and example datasets and associated subroutines may be downloaded as the tar file, hea-www.harvard.edu/SoDDA/SoDDA.tar.gz

| Description | ReadMe | Download |

[CHASC]