The AstroStat Slog » High-Energy

[Book] The Elements of Statistical Learning, 2nd Ed.

hlee — Thu, 22 Jul 2010 13:25:44 +0000

This was written more than a year ago, and I forgot to post it.

I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is The Elements of Statistical Learning. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.

First, the book website:

The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

You’ll find examples, R codes, relevant publications, and plots used in the text books.

Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.

The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.

This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.

The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.

Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.

I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]

—————————– Here, I patch my scribbles about the book.

What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.

You might want to check the book’s website for data sets if you want to try some ideas described there
The Elements of Statistical Learning
In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.

This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.

[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X only through the conditional mean f(x).
The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.

How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.

[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.

Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).

mini-Workshop on Computational AstroStatistics [announcement]

chasc — Mon, 21 Jun 2010 16:25:31 +0000

mini-Workshop on Computational Astro-statistics: Challenges and Methods for Massive Astronomical Data
Aug 24-25, 2010
Phillips Auditorium, CfA,
60 Garden St., Cambridge, MA 02138

URL: http://hea-www.harvard.edu/AstroStat/CAS2010

The California-Boston-Smithsonian Astrostatistics Collaboration plans to host a mini-workshop on Computational Astro-statistics. With the advent of new missions like the Solar Dynamic Observatory (SDO), Panoramic Survey and Rapid Response (Pan-STARRS) and Large Synoptic Survey (LSST), astronomical data collection is fast outpacing our capacity to analyze them. Astrostatistical effort has generally focused on principled analysis of individual observations, on one or a few sources at a time. But the new era of data intensive observational astronomy forces us to consider combining multiple datasets and infer parameters that are common to entire populations. Many astronomers really want to use every data point and even non-detections, but this becomes problematic for many statistical techniques.

The goal of the Workshop is to explore new problems in Astronomical data analysis that arise from data complexity. Our focus is on problems that have generally been considered intractable due to insufficient computational power or inefficient algorithms, but are now becoming tractable. Examples of such problems include: accounting for uncertainties in instrument calibration; classification, regression, and density estimations of massive data sets that may be truncated and contaminated with measurement errors and outliers; and designing statistical emulators to efficiently approximate the output from complex astrophysical computer models and simulations, thus making statistical inference on them tractable. We aim to present some issues to the statisticians and clarify difficulties with the currently used methodologies, e.g. MCMC methods. The Workshop will consist of review talks on current Statistical methods by Statisticians, descriptions of data analysis issues by astronomers, and open discussions between Astronomers and Statisticians. We hope to define a path for development of new algorithms that target specific issues, designed to help with applications to SDO, Pan-STARRS, LSST, and other survey data.

We hope you will be able to attend the workshop and present a brief talk on the scope of the data analysis problem that you confront in your project. The workshop will have presentations in the morning sessions, followed by a discussion session in the afternoons of both days.

SDO launched

vlk — Thu, 11 Feb 2010 19:04:00 +0000

The Solar Dynamics Observatory, which promises a flood of data on the Sun, was launched today from Cape Kennedy.

data analysis system and its documentation

hlee — Fri, 02 Oct 2009 02:11:04 +0000

So far, I didn’t complain much related to my “statistician learning astronomy” experience. Instead, I’ve been trying to emphasize how fascinating it is. I hope that more statisticians can join this adventure when statisticians’ insights are on demand more than ever. However, this positivity seems not working so far. In two years of this slog’s life, there’s no posting by a statistician, except one about BEHR. Statisticians are busy and well distracted by other fields with more tangible data sets. Or compared to other fields, too many obstacles and too high barriers exist in astronomy for statisticians to participate. I’d like to talk about these challenges from my ends.^[1]

The biggest challenge for a statistician to use astronomical data is the lack of mercy for nonspecialists’ accessing data including format, quantification, and qualification^[2] ; and data analysis systems. IDL is costly although it is used in many disciplines and other tools in astronomy are hardly utilized for different projects.^[3] In that regards, I welcome astronomers using python to break such exclusiveness in astronomical data analysis systems.

Even if data and software issues are resolved, there’s another barrier to climb. Validation. If you have a catalog, you’ll see variables of measures, and their errors typically reflecting the size of PSF and its convolution to those metrics. If a model of gaussian assumption applied, in order to tabulate power law index, King’s, Petrosian’s, or de Vaucouleurs’ profile index, and numerous metrics, I often fail to find any validation of gaussian assumptions, gaussian residuals, spectral and profile models, outliers, and optimal binning. Even if a data set is publicly available, I also fail to find how to read in raw data, what factors must be considered, and what can be discarded because of unexpected contamination occurred like cosmic rays and charge over flows. How would I validate raw data that are read into a data analysis system is correctly processed to match values in catalogs? How would I know all entries in catalog are ready for further scientific data analysis? Are those sources real? Is p-value appropriately computed?

I posted an article about Chernoff faces applied to Capella observations from Chandra. Astronomers already processed the raw data and published a catalog of X-ray spectra. Therefore, I believe that the information in the catalog is validated and ready to be used for scientific data analysis. I heard that repeated Capella observation is for the calibration. Generally speaking, in other fields, targets for calibration are almost time invariant and exhibit consistency. If Capella is a same star over the 10 years, the faces in my post should look almost same, within measurement error; but as you saw, it was not consistent at all. Those faces look like observations were made toward different objects. So far I fail to find any validation efforts, explaining why certain ObsIDs of Capella look different than the rest. Are they real Capella? Or can I use this inconsistent facial expression as an evidence that Chandra calibration at that time is inappropriate? Or can I conclude that Capella was a wrong choice for calibration?

Due to the lack of quantification procedure description from the raw data to the catalog, what I decided to do was accessing the raw data and data processing on my own to crosscheck the validity in the catalog entries. The benefit of this effort is that I can easily manipulate data for further statistical inference. Although reading and processing raw data may sound easy, I came across another problem, lack of documentation for nonspecialists to perform the task.

A while ago, I talked about read.table() in R. There are slight different commands and options but without much hurdle, one can read in ascii data in various styles easily with read.table() for exploratory data analysis and confirmatory data analysis with R. From my understanding, statisticians do not spend much time on reading in data nor collecting them. We are interested in methodology to extract information of the population based on sample. While the focus is methodology, all the frustrations with astronomical data analysis softwares occur prior to investigating the best method. The level of frustration reached to the extend of terminating my eagerness for more investigation about inference tools.

In order to assess those Capella observations, thanks to its on-site help, I evoke ciao. Beforehand, I’d like to disclaim that I exemplify ciao to illustrate the culture difference that I experienced as a statistician. It was used to discuss why I think that astronomical data analysis systems are short of documentations and why that astronomical data processing procedures are lack of validation. I must say that I confront very similar problems when I tried to learn astronomical packages such as IRAF and AIPS. Ciao happened to be at handy when writing this post.

In order to understand X-ray data, not only image data files, one also needs effective area (arf), redistribution matrix (rmf), and point spread function (psf). These files are called by calibration data files. If the package was developed for general users, like read.table() I expect there should be a homogenized/centralized data including calibration data reading function with options. Instead, there were various kinds of functions one can use to read in data but the description was not enough to know which one is doing what. What is the functionality of these commands? Which one only stores names of data file? Which one reconfigures the raw data reflecting up to date calibration file? Not knowing complete data structures and classes within ciao, not getting the exact functionality of these data reading functions from ahelp, I was not sure the log likelihood that I computed is appropriate or not.

For example, there are five different ways to associate an arf: read_arf(), load_arf(), set_arf(), get_arf(), and unpack_arf() from ciao. Except unpack_arf(), I couldn’t understand the difference among these functions for accessing an arf^[4] Other softwares including XSPEC that I use, in general, have a single function with options to execute different level of reading in data. Ciao has an extensive web documentation without a tutorial (see my post). So I read all ahelp “commands” a few times. But I still couldn’t decide which one to use for my work to read in arfs and rmfs (I happened to have many calibration data files).

	arf	rmf	psf	pha	data
get	get_arf	get_rmf	get_psf	get_pha	get_data
set	set_arf	set_rmf	set_psf	set_pha	set_data
unpack	unpack_arf	unpack_rmf	unpack_psf	unpack_pha	unpack_data
load	load_arf	load_rmf	load_psf	load_pha	load_data
read	read_arf	read_rmf	read_psf	read_pha	read_data

[Note that above links may not work since ciao documentation website evolves quickly. Some might be routed to different links so please, check this website for other data reading commands: cxc.harvard.edu/sherpa/ahelp/index_alphabet.html].

So, I decide to seek for a help through cxc help desk several months back. Their answers are very reliable and prompt. My question was “what are the difference among read_xxx(), load_xxx(), set_xxx(), get_xxx(), and unpack_xxx(), where xxx can be data, arf, rmf, and psf?” The answer to this question was that

You can find detailed explanations for these Sherpa commands in the “ahelp” pages of the Sherpa website:

http://cxc.harvard.edu/sherpa/ahelp/index_alphabet.html

This is a good answer but a big cultural shock to a statistician. It’s like having an answer like “check http://www.r-project.org/search.html and http://cran.r-project.org/doc/FAQ/R-FAQ.html” for IDL users to find out the difference between read.table() and scan(). Probably, for astronomers, all above various data reading commands are self explanatory like R having read.table(), read.csv(), and scan(). Disappointingly, this answer was not I was looking for.

Well, thanks to this embezzlement, hesitation, and some skepticism, I couldn’t move to the next step of implementing fitting methods. At the beginning, I was optimistic when I found out that Ciao 4.0 and up is python compatible. I thought I could do things more in statistically rigorous ways since I can fake spectra to validate my fitting methods. I was thinking about modifying the indispensable chi-square method that is used twice for point estimation and hypothesis testing that introduce bias (a link made to a posting). My goal was make it less biased and robust, less sensitive iid Gaussian residual assumptions. Against my high expectation, I became frustrated at the first step, reading and playing with data to get a better sense and to develop a quick intuition. I couldn’t even make a baby step to my goal. I’m not sure if it a good thing or not, but I haven’t been completely discouraged. Also, time helps gradually to overcome this culture difference, the lack of documentation.

What happens in general is that, if a predecessor says, use “set_arf(),” then the apprentice will use “set_arf()” without doubts. If you begin learning on your own purely relying on documentations, I guess at some point you have to make a choice. One can make a lucky guess and move forward quickly. Sometimes, one can land on miserable situation because one is not sure about his/her choice and one cannot trust the features appeared after these processing. I guess it is natural to feel curiosity about what each of these commands is doing to your data and what information is carried over to the next commands in analysis procedures. It seems righteous to know what command is best for the particular data processing and statistical inference given the data. What I found is that such comparison across commands is missing in documentations. This is why I thought astronomical data analysis systems are short of mercy for nonspecialists.

Another thing I observed is that there seems no documentation nor standard procedure to create the repeatable data analysis results. My observation of astronomers says that with the same raw data, the results by scientist A and B are different (even beyond statistical margins). There are experts and they have knowledge to explain why results are different on the same raw data. However, not every one can have luxury of consulting those few experts. I cannot understand such exclusiveness instead of standardizing the procedures through validations. I even saw that the data that A analyzed some years back can be different from this year’s when he/she writes a new proposal. I think that the time for recreating the data processing and inference procedure to explain/justify/validate the different results or to cover/narrow the gap could have not been wasted if there are standard procedures and its documentation. This is purely a statistician’s thought. As the comment in where is ciao X?^[5] not every data analysis system has to have similar design and goals.

Getting lost while figuring out basics (handling, arf, rmf, psf, and numerous case by case corrections) prior to applying any simple statistics has been my biggest obstacle in learning astronomy. The lack of documenting validation methods often brings me frustration. I wonder if there’s any astronomers who lost in learning statistics via R, minitab, SAS, MATLAB, python, etc. As discussed in where is ciao X? I wish there is a centralized tutorial that offers basics, like how to read in data, how to do manipulate datum vector and matrix, how to do arithmetics and error propagation adequately not violating assumptions in statistics (I don’t like the fact that the point estimate of background level is subtracted from observed counts, random variable when the distribution does not permit such scale parameter shifting), how to handle and manipulate fits format files from Chandra for various statistical analysis, how to do basic image analysis, how to do basic spectral analysis, and so on with references^[6]

This is quite an overdue posting. Links and associated content can be outdated.
For the classification purpose, data with clear distinction between response and predictor variables so called a training data set must be given. However, I often fail to get processed data sets for statistical analysis. I first spend time to read data and question what is outlier, bias, or garbage. I’m not sure how to clean and extract numbers for statistical analysis and every sub-field in astronomy have their own way to clean to be fed into statistics and scatter plots. For example, image processing is still executed case by case via trained eyes of astronomers. On the other hand, in medical imaging diagnosis specialists offer training sets with which scientists in computer vision develop algorithms for classification. Such collaboration yields accelerated, automatic but preliminary diagnosis tools. A small fraction of results from these preliminary methods still can be ambiguous, i.e. false positive or false negative. Yet, when such ambiguous cancerous cell images at the decision boundaries occur, specialists like trained astronomers scrutinize those images to make a final decision. As medical imaging and its classification algorithms resolve the issue of expert shortage under overflowing images, I wish astronomers adopt their strategies to confront massive streaming images and to assist sparse trained astronomers
Something I like to see is handling background statistically in high energy astrophysics. When simulating a source, background can be simulated as well via Makov Random field, kriging, and other spatial statistics methods. In reality, background is subtracted once in measurement space and the random nature of background is not interactively reflected. Regardless of available statistical methodology to reflect the nature of background, it is difficult to implement it for trial and validation because those tools are not compatible for adding statistical modules and packages.
A Sherpa expert told me there is an FAQ (I failed to locate previously) on this matter. However, from data analysis perspective like a distinction between data.structure, vector, matrix, list and other data types in R, the description is not sufficient for someone who wants to learn ciao and to perform scientific (both deterministic or stochastic) data analysis via scripting i.e. handling objects appropriately. You might want to read comparing commands in Sharpa from Shepa FAQ
I know there is ciaox. Apart from the space between ciao and X, there is another difference that astronomers do not care much compared to statisticians: the difference between X and x. Typically, the capital letter is for random variable and lower case letters for observation or value
By the way, there are ciao workshop materials available that could function as tutorials. Please, locate them if needed.

[MADS] Parallel Coordinates

hlee — Wed, 29 Jul 2009 06:02:18 +0000

Speaking of XAtlas from my previous post I tried another visualization tool called Parallel Coordinates on these Capella observations and two stars with multiple observations (AL Lac and IM Peg). As discussed in [MADS] Chernoff face, full description of the catalog is found from XAtlas website. The reason for choosing these stars is that among low mass stars, next to Capella (I showed 16), IM PEG (HD 21648, 8 times), and AR Lac (although different phases, 6 times) are most frequently observed. I was curious about which variation, within (statistical variation) and between (Capella, IM Peg, AL Lac), is dominant. How would they look like from the parametric space of High Resolution Grating Spectroscopy from Chandra?

Having 13 X-ray line and/or continuum ratios, a typical data display would be the 13 choose 2 combination of scatter plots as follows. Note that the upper left panels with three colors are drawn for the classification purpose (red: AL Lac, blue: IM Peg, green:Capella) while lower right ones are discolored for the clustering analysis purpose. These scatter plots are essential to exploratory data analysis but they do not convey information efficiently with these many scatter plots. In astronomical journals, thanks to astronomers’ a priori knowledge, a fewer pairs of important variables are selected and displayed to reduce the visualization complexity dramatically. Unfortunately, I cannot select physically important variables only.

I am not a well-knowledged astronomer but believe in reducing dimensionality by the research objective. The goal is set from asking questions like “what do you want from this multivariate data set?” classification (classification rule/regression model that separates three stars, Capella, AL Lac, and IM Peg), clustering (are three stars naturally clustered into three groups? Or are there different number of clusters even if they are not well visible from above scatter plots?), hypothesis testing (are they same type of stars or different?), point estimation and its confidence interval (means and their error bars), and variable selection (or dimension reduction). So far no statistical question is well defined (it can be good thing for new discoveries). Prior to any confirmatory data analysis, we’d better find a way to display this multidimensional data efficiently. I thought parallel coordinates serve the purpose well but surprisingly, it was never discussed in astronomical literature, at least it didn’t appear in ADS.

Each 13 variable was either normalized (left) or standardized (right). The parallel coordinate plot looks both simpler and more informative. Capella observations occupy relatively separable space than the other stars. It is easy to distinguish that one Capella observation is an obvious outlier to the rest which is hardly seen from scatter plots. It is clear that discriminant analysis or classical support vector machine type classification methods cannot separate AL Lac and IM Pec. Clustering based on distance measures of dissimilarity also cannot be applied in order to see a natural grouping of these two stars whereas Capella observations form its own cluster. To my opinion, parallel coordinates provide more information about multidimensional data (dim>3) in a simpler way than scatter plots of multivariate data. It naturally shows highly correlated variables within the same star observations or across all target stars. This insight from visualization is a key to devising methods of variable selection or reducing dimensionality in the data set.

Personal opinion is that not having an efficient and informative visualization tool for visualizing complex high resolution spectra in many detailed metrics, smoothed bivariate (trivariate at most) information such as hardness ratios and quantiles are utilized in displaying X-ray spectral data, instead. I’m not saying that the parallel coordinates are the ultimate answer to visualizing multivariate data but I’d like to emphasize that this method is more informative, intuitive and simple to understand the structure of relatively high dimensional data cloud.

Parallel coordinates has a long history. The earliest discussion I found was made in 1880ies. It became popular by Alfred Inselberg and gained recognition among statisticians by George Wegman (1990, Hyperdimensional Data Analysis Using Parallel Coordinates). Colorful images of the Sun, stars, galaxies, and their corona, interstellar gas, and jets are the eye catchers. I hope that data visualization tools gain equal spot lights since they summarize data and deliver lots of information. If images are well decorated cakes, then these tools from EDA are sophisticated and well baked cookies.

——————- [Added]
According to

[arxiv:0906.3979] The Golden Age of Statistical Graphics
Michael Friendly (2008)
Statistical Science, Vol. 23, No. 4, pp. 502-535

it is 1885. Not knowing French – if I knew I’d like to read Gauss’ paper immediately prior to anything – I don’t know what the reference is about.

different views

hlee — Mon, 13 Jul 2009 00:33:25 +0000

An email was forwarded with questions related to the data sets found in “Be an INTEGRAL astronomer”. Among the sets, the following scatter plot is based on the Crab data.

If you do not mind the time predictor, it is hard to believe that this is a light curve, time dependent data. At a glance, this data set represents a simple block design for the one-way ANOVA. ANOVA stands for Analysis of Variance, which is not a familiar nomenclature for astronomers.

Consider a case that you have a very long strip of land that experienced FIVE different geological phenomena. What you want to prove is that crop productivity of each piece of land is different. So, you make FIVE beds and plant same kind seeds. You measure the location of each seed from the origin. Each bed has some dozens of seeds, which are very close to each other but their distances are different. On the other hand, the distance between planting beds are quite far unable to say that plants in the test bed A affects plants in B. In other words, A and B are independent suiting for my statistical inference procedure by the F-test. All you need is after a few months, measuring the total weight of crop yield from each plant (with measurement errors).

Now, let’s look at the plot above. If you replace distance to time and weight to flux, the pattern in data collection and its statistical inference procedure matches with the one-way ANOVA. It’s hard to say this data set is designed for time series analysis apart from the complication in statistical inference due to measurement errors. How to design the statistical study with measurement errors, huge gaps in time, and unequal time intervals is complex and unexplored. It depends highly on the choice of inference methods, assumptions on error i.e. likelihood function construction, prior selection, and distribution family properties.

Speaking of ANOVA, using the F-test means that we assume residuals are Gaussian from which one can comfortably modify the model with additive measurement errors. Here I assume there’s no correlation in measurement errors and plant beds. How to parameterize the measurement errors into model depends on such assumptions as well as how to assess sampling distribution and test statistics.

Although I know this Crab nebula data set is not for the one-way ANOVA, the pattern in the scatter plot drove me to test the data set. The output said to reject the null hypothesis of statistically equivalent flux in FIVE time blocks. The following is R output without measurement errors.

Df Sum Sq Mean Sq F value Pr(>F)
factor 4 4041.8 1010.4 143.53 < 2.2e-16 ***
Residuals 329 2316.2 7.0

If the gaps are minor, I would consider time series with missing data next. However, the missing pattern does not agree with my knowledge in missing data analysis. I wonder how astronomers handle such big gaps in time series data, what assumptions they would take to get a best fit and its error bar, how the measurement errors are incorporated into statistical model, what is the objective of statistical inference, how to relate physical meanings to statistical significant parameter estimates, how to assess the model choice is proper, and more questions. When the contest is over, if available, I’d like to check out any statistical treatments to answer these questions. I hope there are scientists who consider similar statistical issues in these data sets by the INTEGRAL team.

how to trace?

hlee — Thu, 11 Jun 2009 20:52:57 +0000

I was at the SUSY 09 public lecture given by a Nobel laureate, Frank Wilczek of QCD (quantum chromodynamics). As far as I know SUSY is the abbreviation of SUperSYmetricity in particle physics. Finding such antimatter(? I’m afraid I read “Angels and Demons” too quickly) will explain the unification theory among electromagnetic, weak, and strong forces and even the gravitation according to the speaker’s graph. I’ll not go into the details of particle physics and the standard model. The reason is too obvious. Instead, I’d like to show this image from wikipedia and to discuss my related questions.

Whenever LHC (Large Hardron Collider, several posts from the slog) is publicly advertised, the grand scale of accelerator (26km) is the center of attention for these unprecedented controlled experiments for particle physics researches. Controlled in conjunction of factorization in statistical experiment designs to eliminate unknowns and to factor in external components (covariates, for example). By the same token, not the grand scale of the accelerator, but the detector and controlled/isolated system, and its designs for collecting data seem most important to me. Without searching for reports, I want to believe that many countless efforts have been put into detectors and data processors, which seem to be overshadowed because of the grand scale of the accelerating tube.

For fun and honoring the speaker’s showing it to the public, you might like to see this youtube rap again.

As a statistician, curious about the detector and the physics leading the designs of such expensive and extreme studies, I was more interested in knowing further on

how data are collected and
how study was designed or what are the hypotheses

not the scale of the accelerator nor the feeling inside the 2 degree vacuum tube. There was no clue to find out partial answers to these questions through the public lecture. So, I hope some slog readers could help me understand better the following issues spawn from this public lecture. Let me talk my questions statistically and try to associate them with the image above.

Uncertainty Principle

The uncertainty principle by physicists is written roughly as follows:

Δ E Δ t > h

where h is Plank’s constant. Instead of energy and time, Δ x Δ p > h, location and momentum is used as well. This principle is more or less related to precision or bias. One cannot measure things with 100% precision. In other word, in measuring quantities from physics, there is no exact unbiased estimator (asymptotically unbiased is a different context). In order to observe subparticle in a short time scale, the energy must be high. Yet, unless the energy is extremely high, the uncertainty of when the event happen is huge so that no one can assign exact numerals when the eveny happen. This uncertainty principle is the primary reason for such large accelerator so that particle can gain tremendous energy and therefore, an observer can determine the location and the time of the event (collision, subsequent annihilation, and scatters of subparticles) with uncertainty from the principle of physics.

What is Uncertainty?

I’ve always had a conflicting notion about uncertainty in statistics and astronomy. The uncertainty from the Heigenberg’s uncertainty principle and the uncertainty from measure theory and the stochastic nature of data. Although the word is same but the implications are different. The former describes precision as discussed above and the later accuracy (Bevington’s book describes the difference between precision and accuracy, if I recall correctly). When an astronomer has data and computes a best fit and one σ from the chi-square, that σ is quantifying the uncertainty/scale of the Gaussian distribution, a model for residuals that the astronomer has chosen for fitting the data with the model of physics.

When it comes to measurement errors it’s more like discussing precision, not accuracy or the scale parameters of distribution functions (family of distributions). Either measurement errors, or computing uncertainty via chi-square minimization or Bayesian posterior distribution estimation, most of procedures to understand uncertainties in astronomical literature is based on parametrizating uncertainties. Luckily we know that Gaussian and Poisson distribution for parametrization works almost all cases in astronomy. Yet, my understanding is that there’s not much distinction between precision and accuracy in astronomical data analysis, not much awareness about the difference between the uncertainty principle from physics and the uncertainty by the stochastic nature of data. This seems causing biased or underestimated results. With jargon of statistics, instead of overlooking, the issues of model mis-identification and model uncertainty^[1] of other disciplines are worth to be looked into to narrow the gap.

As a statistician, I approach the problem of uncertainty hierarchically. Start from the simplest that sigma is known and used the given sigma as the ground truth. If statistics does not advocate such condition, then move to a direction of estimating it, and testing whether it is homogenous or heterogeneous error, etc to understand the sampling distribution better and device statistics accordingly. During the procedure, I’ll add a model for measurement errors. If Gaussian, adding statistical uncertainty terms and measurement error terms works well, an easy convolution of Gaussian distributions (see my why gaussianity?). I might have to ignore some factors in my hierarchical modeling procedure if their contribution is almost none but the hierarchical model becomes too complicated for such mediocre gain. Instead, it would be easy to follow the rule of thumb strategy developed by astronomers with great knowledge and experience. Anyway, if parametric strategy does not work, I’ll employ nonparametric approaches. Focusing on Bayesian methodology, it’s like modeling hierarchically from parametric likelihoods and informative, subjective priors to nonparametric likelihoods, objective, noninformative priors. Overall, these are efforts of modeling both physics and errors assuming that measurements are taken accurately; multiple measurements and collecting many photons quantifies how accurately the best fit is obtained. On the other hand, under the uncertainty principle, intrinsic measurement bias (unknown but bounded) is inevitable. Not statistics but physics could tell how precise measurements can be taken. Still it’s uncertainty but different kind. I sometimes confront astronomers mixing strategies of calibrating the uncertainties of different grounds and also I got confused and lost.

I’d like to say that multiple observations (the amount of degrees of freedom in chi-square minimizations, and bins in histograms) are realizations of coupling of bias and variance (precision and accuracy; measurement errors and statistical uncertainty in sigma/error bar) from which the importance of proper parametrization and regularized optimization is never enough to be emphasized to get that right 68% coverage of the uncertainty in a best fit, instead of simple least square or chi-square. Statisticians often discuss the mean square error (see my post [MADS] Law of Total Variance) than the error bar to account for the overall uncertainty in a best fit.

I’m afraid that my words sound gibberish – I hope that statisticians with good commands of literal and scientific languages discuss the uncertainty of physics and of statistics and how it affects choosing statistical methods and drawing statistical inference from (astro)physical data. I’m also afraid that people continue going for one sigma by feeding the data into the chi-square function and adding speculated systematic errors (say 15% of the computed sigma from the chi-square minimization) without second thoughts on the implications of uncertainty and on assumptions for its quantification methods.

Identifiability

I wonder how the shot of above image is taken when protons are colliding. There should be a tremendous number of subparticles generated from the collisions of many protons. Unless there is a single photo frame that takes traces of all those particles (collision happens in 3D camera chamber? Perhaps, they use medical imaging, tomography techniques but processing time wise I doubt its feasibility), I think those traces are the reconstruction of multiple cross sectional shots. My biggest concern was how each line and dot you see from the picture can be associated to a certain particle. Physics and standard model can tell that their trajectories are distinguishable, depending on their charges, types, and mass but there are, say, millions of events happening in the matter of extremely short time scale! How certain one can say this is the trace of a certain particle.

The speaker discussed massive data and uncertainty as another challenge. So many procedures in terms of (statistical) data analysis seem not explored yet although theory of physics is very sophisticated and complicated. If physics is an deductive/deterministic science, then statistics is inductive/stochastic. I personally believe that theories are able to conclude the same from both physical and statistical experiments. I guess now it’s time to prove such thesis with data and statistics and it starts with identifying particles’ traces and their meta-data.

image reconstruction

To create an image of many particles as above when we have the identifiability issue and the uncertainty in time and space, I wonder how pictures are constructed from each collision. The lecturer used an analogy of a dodecaheron calendar with missing months to deliver the feeling of image reconstruction in particle physics. Whenever I see such images of many ray traces and hear promises that LHC will deliver, I’ve been wondering how they reconstruct those traces after the particle collisions and measuring times of events. Thanks to the uncertainty principle and its mediocre scale, there must be some tremendous constraints and missings. How much information is contained in that reconstructed image? How much information loss is inevitable due to those constraints. It would be very interesting to know each step from detectors to images and find statistical and information theoretical challenges.

massive data processing

Colliding one proton to the other seems ideal to discover the unification theory advocating the standard model by tracing individual relatively small number of particles. If so, the picture above could have been simpler than what it looks. Unfortunately, it’s not the case and huge number of protons are sent for collision. I’ve kept heard the gigantic size of data that particle physics experiments create. I wondered how such massive data are processed while the speaker showed the picture of one of world best computing facilities at CERN. Not just for automated pipelining but for processing, cleaning, summarizing, and evaluating from statistical aspects would require clever algorithms to make most of those multiple processors.

hypothesis testing

I still think that quests for searching particles via LHC are classical decision theoretic hypothesis testing problem: the null hypothesis is no new(unobserved particle) vs. the alternative hypothesis contains the model/information of new particle by the theory (SUSY, antimatter etc). Statistically speaking, in order to observe such matter or to reject null hypothesis comfortable, we need statistically powerful tests, where Neyman-Pearson test/construction is often mentioned. One needs to design an experiment that is powerful enough (power here has two connotations: one is physically powerful enough to make proton have high energy so that one can observe particles in the brief time and space frame, and the other is statistically powerful such that if such new particles exist, the test is powerful enough to reject the null hypothesis with decent power and false discovery rate). How to transcribe data and models into a powerful test seems still an open question to physicists. You can check discussions from the links in the PHYSTAT LHC 2007 post.

source detection

In the similar context of source detection in astronomy, how do physicists define and segregate source (particle of interest, higgs, for example) from background? It’s also related to identifiability of particles shown in the picture. How can a physicist see an rare event among tons of background events which form a wide sampling distribution or in other words, that have a huge uncertainty as an ensemble. Also the source event has its own uncertainty because of the uncertainty principle. How to form robust thresholding methods? How to develop Bayesian learning strategies for better detection? Perhaps the underlying (statistical) models are different for particle physics and for astronomy, but the basic idea of how to apply statistical inference seem not much different from the fact that 1. background can be more dominant, 2. background is used for the null hypothesis, and 3. the source distribution comprise the alternative distribution. It’ll be very interesting to collect statistics for source detection and formalize those methods so that consistent source detection results can be achieved by devising statistics suits the data types.

cliche and irony

I’d like to quote two phrases from the public lecture.

finding an atom in a haystack

A cliche for all groups of scientists. I’ve heard “finding a needle in a haystack” so many times because of the new challenge that we confronted from the information era. On the other hand, replacing “needle” to “atom” was new to me. Unfortunately, my impression is that physicists are not equipped with tools to do such data mining either a needle or an atom. I wonder what computer scientists can offer them for this more challenging quest to answer the fundamental question about the universe.

it’s an exciting time to be physicists

The speaker used physicists but I’ve heard the same sentence from astronomers and statisticians with their professions replaced with physicists. After hearing it too often from various people, I became doubtful since I cannot feel such excitements imminently. It feels like after hearing change too often before it happens, one cannot feel the real progress of changes. Words always travel faster than actions. Sometimes words can be just empty promise. That’s why I thought it’s a cliche and irony. Perhaps it’s due to that fact I’m at the intersection of the combination of these scientist sets, not at the center of any set. Ironically, defining boundaries is also fuzzy nowadays. Perhaps, I’m already excited and afraid of transiting down to a lower energy level. Anyway, being enthusiastic and living in an exciting time seems different matter.

What will come next?

I haven’t heard the news about Phystat 2009, whose previous meetings occurred every odd year in the 21st century. Personally, their meeting agenda and subsequent proceedings were very informative and offered clues to my questions. I hope the next meeting soon to be held.

the notions of model uncertainty among astronomers and statisticians are different. Hopefully, I have time to talk about it

[ArXiv] Sparse Poisson Intensity Reconstruction Algorithms

hlee — Thu, 07 May 2009 16:14:39 +0000

One of [ArXiv] papers from yesterday whose title might drag lots of attentions from astronomers. Furthermore, it’s a short paper.
[arxiv:math.CO:0905.0483] by Harmany, Marcia, and Willet.

Estimating f under “Sparse Poisson Intensity” condition is an frequently appearing topic in high energy astrophysics data analysis. Some might like to check references in the paper, which offer solutions to compressed sensing problems with different kinds of sparsity, minimization approaches, and constraints on f.

Apart from the technical details, the first two sentences from the conclusion,

We have developed computational approaches for signal reconstruction from photon-limited measurements – a situation prevalent in many practical settings. Our method optimizes a regularized Poisson likelihood under nonnegativity constraints

tempt me to study and try their algorithm.

[MADS] Chernoff face

hlee — Thu, 02 Apr 2009 16:00:41 +0000

I cannot remember when I first met Chernoff face but it hooked me up instantly. I always hoped for confronting multivariate data from astronomy applicable to this charming EDA method. Then, somewhat such eager faded, without realizing what’s happening. Tragically, this was mainly due to my absent mind.

After meeting Prof. Herman Chernoff unexpectedly – I didn’t know he is Professor Emeritus at Harvard – the urge revived but I didn’t have data, still then. Alas, another absent mindedness: I don’t understand why I didn’t realize that I already have the data, XAtlas for trying Chernoff faces until today. Data and its full description is found from the XAtlas website (click). For Chernoff face, references suggested in Wiki:Chernoff face are good. I believe some folks are already familiar with Chernoff faces from a New York Times article last year, listed in Wiki (or a subset characterized by baseball lovers?).

Capella is a X-ray bright star observed multiple times for Chandra calibration. I listed 16 ObsIDs in the figures below at each face, among 18+ Capella observations (Last time when I checked Chandra Data Archive, 18 Capella observations were available). These 16 are high resolution observations from which various metrics like interesting line ratios and line to continuum ratios can be extracted. I was told that optically it’s hard to find any evidence that Capella experienced catastrophic changes during the Chandra mission (about 10 years in orbit) but the story in X-ray can’t be very different. In a dismally short time period (10 years for a star is a flash or less), Capella could have revealed short time scale high energy activities via Chandra. I just wanted to illustrate that Chernoff faces could help visualizing such changes or any peculiarities through interpretation friendly facial expressions (Studies have confirmed babies’ ability in facial expression recognitions). So, what do you think? Do faces look similar/different to you? Can you offer me astronomical reasons for why a certain face (ObsID) is different from the rest?

p.s. In order to draw these Chernoff faces, check descriptions of these R functions, faces() (yields the left figure) or faces2() (yields the right figure) by clicking on the function of your interest. There are other variations and other data analysis systems offer different fashioned tools for drawing Chernoff faces to explore multivariate data. Welcome any requests for plots in pdf. These jpeg files look too coarse on my screen.

p.p.s. Variables used for these faces are line ratios and line to continuum ratios, and the order of these input variables could change countenance but impressions from faces will not change (a face with distinctive shapes will look different than other faces even after the order of metrics/variables is scrambled or using different Chernoff face illustration tools). Mapping, say from an astronomical metric to the length of lips was not studied in this post.

p.p.p.s. Some data points are statistical outliers, not sure about how to explain strange numbers (unrealistic values for line ratios). I hope astronomers can help me to interpret those peculiar numbers in line/continuum ratios. My role is to show that statistics can motivate astronomers for new discoveries and to offer different graphics tools for enhancing visualization. I hope these faces motivate some astronomers to look into Capella in XAtlas (and beyond) in details with different spectacles, and find out the reasons for different facial expressions in Capella X-ray observations. Particularly, ObsID 1199 is most questionable to me.

4754 d.f.

hlee — Tue, 17 Mar 2009 19:37:44 +0000

I couldn’t believe my eyes when I saw 4754 degrees of freedom (d.f.) and chi-square test statistic 4859. I’ve often enough seen large degrees of freedom from journals in astronomy, several hundreds to a few thousands, but I never felt comfortable at these big numbers. Then with a great shock 4754 d.f. appeared. I must find out why I feel so bothered at these huge degrees of freedom.

When I was learning statistics, I never confronted such huge degrees of freedom. Well, given the facts that only a small amount of time is used for learning the chi-square goodness-of-fit test, that the chi-square distribution is a subset of gamma distribution, and that statisticians do not handle a hundred of thousands (there are more low count spectra but I’ll discuss why I chose this big number later) of photons from X-ray telescopes, almost surely no statistician would confront such huge degrees of freedom.

Degrees of freedom in spectral fitting are combined results of binning (or grouping into n classes) and the number of free parameters (p), i.e. n-p-1. Those parameters of interest, targets to be optimized or to be sought for solutions are from physical source models, which are determined by law of physics. Nothing to be discussed from the statistical point of view about these source models except the model selection and assessment side, which seems to be almost unexplored area. On the other hand, I’d like to know more about binning and subsequent degrees of freedom.

A few binning schemes in spectral analysis that I often see are each bin having more than 25 counts (the same notion of 30 in statistics for CLT or the last number in a t-table) or counts in each bin satisfying a certain signal to noise ratio S/N level. For the latter, it is equivalent that sqrt(expected counts) is larger than the given S/N level since photon counts are Poisson distributed. There are more sophisticated adaptive binning strategies but I haven’t found mathematical, statistical, nor computational algorithmic justifications for those. They look empirical procedures to me that are discovered after many trials and errors on particular types of spectra (I often become suspicious if I can reproduce the same goodness of fit results with the same ObsIDs as reported in those publications). The point is that either simple or complex, at the end, if someone has a data file with large number of photons, n is generally larger than observations with sparse photons. This is the reason I happen to see inconceivable d.f.s to a statistician from some papers, like 4754.

First, the chi-square goodness of fit test was designed for agricultural data (or biology considering Pearson’s eugenics) where the sample size is not a scale of scores of thousands. Please, note that bin in astronomy is called cell (class, interval, partition) in statistical papers and books showing applications of chi-square goodness fit tests.

I also like to point out that the chi-square goodness of fit test is different from the chi-square minimization even if they share the same equation. The former is for hypothesis testing and the latter is for optimization (best fit solution). Using the same data for optimization and testing introduces bias. That’s one of the reasons why with large number of data points, cross validation techniques are employed in statistics and machine learning^[1]. Since I consider binning as smoothing, the optimal number of bins and their size depends on data quality and source model property as is done in kernel density estimation or imminently various versions of chi-square tests or distance based nonparametric tests (K-S test, for example).

Although published many decades ago, you might want to check this paper out to get a proper rule of thumb for the number of bins:
“On the choice of the number of class intervals in the application of the chi square test” (JSTOR link) by Mann and Wald in The Annals of Mathematical Statistics, Vol. 13, No. 3 (Sep., 1942), pp. 306-317 where they showed that the number of classes is proportional to N^(2/5) (The underlying idea about the chi-square goodness of fit tests, detailed derivation, and exact equation about the number of classes is given in detail) and this is the reason why I chose a spectrum of 10^5 photons at the beginning. By ignoring other factors in the equation, 10^5 counts roughly yields 100 bins. About 4000 bins implies more than a billion photons, which seems a unthinkable number in X-ray spectral analysis. Furthermore, many reports said Mann and Wald’s criterion results in too many bins and loss of powers. So, n is subject to be smaller than 100 for 10^5 photons.

The other issue with statistical analysis on X-ray spectra is that although photons in each channel/bin can be treated as independent sample but the expected numbers of photons across bins are related via physical source model or so called link function borrowed from generalized linear model. However, well studied link functions in statistics do not match source models in high energy astrophysics. Typically, source models are not analytical. They are non-linear, numerical, tabulated, or black box type that are incompatible with current link functions in generalized linear model that is a well developed, diverse, and robust subject in statistics for inference problems. Therefore, binning data and chi-square minimization seems to be an only strategy for statistical inference about parameters in source models so far (for some “specific” statistical or physical models, this is not true, which is not a topic of this discussion). Mann and Wald’s method for class size assumes equiprobable bins whereas channel or bin probabilities in astronomy would not satisfy the condition. The probability vector of multinomial distribution depends on binning, detector sensitivity, and source model instead of the equiprobable constraint from statistics. Well, it is hard to device an purely statistically optimal binning/grouping method for X-ray spectral analysis.

Instead of individual group/bin dependent smoothing (S/N>3 grouping, for example), I, nevertheless, wish for developing binning/grouping schemes based on total sample size N particularly when N is large. I’m afraid that with the current chi-square test embedded in data analysis packages, the power of a chi-square statistic is so small and one will always have a good reduced chi-square value (astronomers’ simple model assessment tool: the measure of chi-square statistic divided by degrees of freedom and its expected value is one. If the reduced chi-square criterion is close to one, then the chosen source model and solution for parameters is considered to be best fit model and value). The fundamental idea of suitable number of bins is equivalent to optimal bandwidth problems in kernel density estimation, of which objective is accentuating the information via smoothing; therefore, methodology developed in the field of kernel density estimation may suggest how to bin/group the spectrum while preserving the most of information and increasing the efficiency. A modified strategy for binning and applying the chi-square test statistic for assessing model adequacy should be conceived instead of reporting thousands of degrees of freedom.

I think I must quit before getting too bored. Only I’d like to mention quite interesting papers that cited Mann and Wald (1942) and explored the chi square goodness of fit including Johnson’s A Bayesian chi-square test for Goodness-of-Fit (a link is made to the arxiv pdf file) which might provide more charm to astronomers who like to modify their chi-square methods in a Bayesian way. A chapter “On the Use and Misuse of Chi-Square” (link to google book excerpt) by KL Delucchi in A Handbook for Data Analysis in the Behavioral Sciences (1993) reads quite intriguing although the discussion is a reminder for behavior scientists.

Lastly, I’m very sure that astronomers explored properties of the chi-square statistic and chi-square type tests with their data sets. I admit that I didn’t make an expedition for such works since those are few needles in a mound of haystack. I’ll be very delighted to see an astronomers’ version of “use and misuse of chi-square,” a statistical account for whether the chi-square test with huge degrees of freedom is powerful enough, or any advice on that matter will be very much appreciated.

a rough sketch of cross validation: assign data into a training data set and a test set. get the bet fit from the training set and evaluate the goodness-of-fit with that best fit with the test set. alternate training and test sets and repeat. wiki:cross_validationa