The AstroStat Slog » Data Processing http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 [AAS-HEAD 2011] Time Series in High Energy Astrophysics http://hea-www.harvard.edu/AstroStat/slog/2011/head2011/ http://hea-www.harvard.edu/AstroStat/slog/2011/head2011/#comments Fri, 09 Sep 2011 17:05:33 +0000 vlk http://hea-www.harvard.edu/AstroStat/slog/?p=4279 We organized a Special Session on Time Series in High Energy Astrophysics: Techniques Applicable to Multi-Dimensional Analysis on Sep 7, 2011, at the AAS-HEAD conference at Newport, RI. The talks presented at the session are archived at http://hea-www.harvard.edu/AstroStat/#head2011

A tremendous amount of information is contained within the temporal variations of various measurable quantities, such as the energy distributions of the incident photons, the overall intensity of the source, and the spatial coherence of the variations. While the detection and interpretation of periodic variations is well studied, the same cannot be said for non-periodic behavior in a multi-dimensional domain. Methods to deal with such problems are still primitive, and any attempts at sophisticated analyses are carried out on a case-by-case basis. Some of the issues we seek to focus on are methods to deal with are:
* Stochastic variability
* Chaotic Quasi-periodic variability
* Irregular data gaps/unevenly sampled data
* Multi-dimensional analysis
* Transient classification

Our goal is to present some basic questions that require sophisticated temporal analysis in order for progress to be made. We plan to bring together astronomers and statisticians who are working in many different subfields so that an exchange of ideas can occur to motivate the development of sophisticated and generally applicable algorithms to astronomical time series data. We will review the problems and issues with current methodology from an algorithmic and statistical perspective and then look for improvements or for new methods and techniques.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2011/head2011/feed/ 0
some python modules http://hea-www.harvard.edu/AstroStat/slog/2009/python-module/ http://hea-www.harvard.edu/AstroStat/slog/2009/python-module/#comments Fri, 13 Nov 2009 21:46:54 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=2507 I was told to stay away from python and I’ve obeyed the order sincerely. However, I collected the following stuffs several months back at the instance of hearing about import inference and I hate to see them getting obsolete. At that time, collecting these modules and getting through them could help me complete the first step toward the quest Learning Python (the first posting of this slog).

There are quite many websites dedicated to python as you already know. Some of them talk only to astronomers. A tiny fraction of those websites are for statisticians but I haven’t met any statistician preferring only python. We take the gist of various languages. So, I’ll leave a general website aggregation, such as AstroPy (I think this website is extremely useful for astronomers), to enrich your bookmark under the “python” tab regardless of your profession. Instead, I’ll discuss some python libraries and modules that can be useful for those exercising astrostatistics and make their work easier. I must say that by intention I omitted a few modules because I was not sure their publicity and copyright sensitivity. If you have modules that can be introduced publicly, let me know. I’ll be happy to add them. If my description is improper and want them to be taken off, also let me know.

Over the past few years, python became the most common and versatile script language for both communities, and therefore, I believe, it would accelerate many collaborations. Much of my time is spent to find out how to read, maneuver, and handle raw data/image. Most of tactics for astronomers are quite unfamiliar, sometimes insensible to me (see my read.table() and data analysis system and its documentation). Somehow, one script language, thanks to its open and free intention to all communities, is promising by narrowing the gap for prosperous and efficient collaborations, Python

The first posting on this slog was about Python. I thought that kicking off with a computer language relatively new and open to many communities could motivate me and others for more interdisciplinary works with diversity. After a few years, unfortunately, I didn’t achieve that goal. Yet, I still think that these libraries and modules, introduced below, to be useful for your transition from some programming languages, or for writing your own but pro bono wrapper for better communication with the others.

I’ll take numpy, scipy, and RPy for granted. For the plotting purpose, matplotlib seems most common.

Reading astronomical data (click links to download libraries, modules, and tutorials)

  • First, start with Using Python for Interactive Data Analysis (in pdf) Quite useful manual, particularly for IDL users. It compares pros and cons of Python and IDL.
  • IDLsave Simply, without IDL, a .save file becomes legible. This is a brilliant small module.
  • PyRAF (I was really frustrated with IRAF and spent many sleepless nights. Apart from data reduction, I don’t remember much of statistics from IRAF except simple statistics for Gaussian populations. I guess PyRAF does better job). And there’s PyFITS for handling fits format data.
  • APLpy (the Astronomical Plotting Library in Python) is a Python module aimed at producing publication-quality plots of astronomical imaging data in FITS format (this introduction is copied from the APLpy site).

Statistics, Mathematics, or data science
Due to RPy, introducing smaller modules seems not much worthy but quite many modules and library for statistics are available, not relying on R.

  • MDP (Modular toolkit for Data Processing)
    Multivariate data analysis methods like PCA, ICA, FA, etc. become very popular in the astronomical society.
  • pywavelets (Not only FT, various transformation methodologies are often used and wavelet transformation ranks top).
  • PyIMSL (see my post, PyIMSL)
  • PyMC I introduced this module in a century ago. It may be lack of versatility or robustness due to parametric distribution objects but I liked the tutorial very much from which one can expand and devise their own working MCMC algorithm.
  • PyBUGS (I introduced this python wrapper in BUGS but the link to PyBUGS is not working anymore. I hope it revives.)
  • SAGE (Software for Algebra and Geometry Experimentation) is a free open-source mathematics software system licensed under the GPL (Link to the online tutorial).
  • python_statlib descriptive statistics for the python programming language.
  • PYSTAT Nice website but the product is not available yet. Be aware! It is not PhyStat!!!

Module for AstroStatistics
import inference (Unfortunately, the links to examples and tutorial are not available currently)

Without clear objectives, it is not easy to pick up a new language. If you are used to work with one from alphabet soup, you most likely adhere to your choice. Changing alphabets or transferring language names only happens when your instructor specifically ask you to use their preferring languages and when analysis {modules, libraries, tools} are only available within that preferred language. Somehow, thanks to the object oriented style, python makes transition and communication easier than other languages. Furthermore, script languages are more intuitive and better interpretable.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/python-module/feed/ 2
[ArXiv] classifying spectra http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-classifying-spectra/ http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-classifying-spectra/#comments Fri, 23 Oct 2009 00:08:07 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3866

[arXiv:stat.ME:0910.2585]
Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications
by Murphy, Dean, and Raftery

Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.

Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.

Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-classifying-spectra/feed/ 0
SINGS http://hea-www.harvard.edu/AstroStat/slog/2009/sings/ http://hea-www.harvard.edu/AstroStat/slog/2009/sings/#comments Wed, 07 Oct 2009 01:30:41 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3628

From SINGS (Spitzer Infrared Nearby Galaxies Survey): Isn’t it a beautiful Hubble tuning fork?

As a first year graduate student of statistics, because of the rumor that Prof. C.R.Rao won’t teach any more and because of his fame, the most famous statistician alive, I enrolled his “multivariate analysis” class without thinking much. Everything is smooth and easy for him and he has incredible memories of equations and proofs. However, I only grasped intuitive concepts like why the method works, not details of mathematics, theorems, and their proofs. Instantly, I began to think how methods can be applied to astronomical data. After a few lessons, I desperately wanted to try out multivariate analysis methods to classify galactic morphology.

The dream died shortly because there’s no data set that can be properly fed into statistical methods for classification. I spent quite time on searching some astronomical data bases including ADS. This was before SDSS or VizieR become popular as now. Then, I thought about applying them to classify supernovae because understanding the pattern of their light curves tells a lot of the history of our universe (Type Ia SNe are standard candle) and because I know some publicly available SN light curves. Immediately, I realize that individual light curves are biased from the sampling perspective. I do not know how to correct them for multivariate analysis. I also thought about applying multivariate analysis methods to stellar spectral types and stars of different mechanical systems (single, binary, association, etc). I thought about how to apply newly learned methods to every astronomical objects that I learned, from sunspots to AGNs.

Regardless of target objects to be scrutinized under this fascinating subject “multivariate analysis,” two factors kept discouraged me: one was that I didn’t have enough training to develop new statistical models in a couple of weeks to reflect unique statistical challenges embedded in data that have missings, irregularities, non-iid, outliers and others that are hardly transcribed into statistical setting, and the other, which was more critical, was that no accessible astronomical database repository for statistical learning. Without deep knowledge in astronomy and trained skills to handle astronomical data, catalogs are generally useless. Those catalogs and data sets in archives are different from data sets from data repositories in machine learning (these data sets are intuitive).

Astronomers would think analyzing toy/mock data sets is not scientific because it’s not leading to any new discovery which they always make. From data analyst viewpoints, scientific advances mean finding tools that summarize data in an optimal manner. As I demanded in Astroinformatics, methods for retrieving information can be attempted and validated with well understood, astrophysically devastated data sets. Pythagoras theorem was proved not only once but there are 39 different ways to prove it.

Seeing this nice poster image (the full resolution image of 56MB is available from the link), brought me some memory of my enthusiasm of applying statistical learning methods for better knowledge discovery. As you can see there are so many different types of galaxies and often times there is no clear boundary between them – consider classifying blurry galaxies by eyes: a spiral can be classified as a irregular, for example. Although I wish for automatic classification of these astrophysical objects, because of difficulties in composing a training set for classification or collecting data of distinctive manifold groups for clustering, as much as complexity that this tuning fork shows, machine learning procedures is equally complicated to be developed. Complex topology of astronomical objects seems to be the primary reason of lacking in statistical learning applications compared to other fields.

Nonetheless, multivariable analysis can be useful for viewing relations from different perspectives, apart from known physics models. It may help to develop more fine tuned physics model by taking latent variables into account that are found from statistical learning processes. Such attempts, I believe, can assist astronomers to design telescopes and to invent efficient ways to collect/analyze data by knowing which features are more significant than others to understand morphological shape of galaxies, patterns in light curves, spectral types, etc. When such experiences accumulate, different insights of physics can kick in like scientists scrambled and assembled galaxies into a tuning fork that led developing various evolution models.

To make a long story short, you have two choices: one, just enjoy these beautiful pictures and apprehend the complexity of our universe, or two, this picture of Hubble’s tuning fork can be inspirational to you for advances in astroinformatics. Whichever path you choose, it’s your time worthy.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/sings/feed/ 0
[MADS] Kalman Filter http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kalman-filter/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kalman-filter/#comments Fri, 02 Oct 2009 03:18:32 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1397 I decide to discuss Kalman Filter a while ago for the slog after finding out that this popular methodology is rather underrepresented in astronomy. However, it is not completely missing from ADS. I see that the fulltext search and all bibliographic source search shows more results. Their use of Kalman filter, though, looked similar to the usage of “genetic algorithms” or “Bayes theorem.” Probably, the broad notion of Kalman filter makes it difficult my finding Kalman Filter applications by its name in astronomy since often wheels are reinvented (algorithms under different names have the same objective).

When I learned “Kalman filter” for the first time, I was not sure how to distinguish it from “Yule-Walker equation” (time series), “Pade approximant, (unfortunately, the wiki page does not have its matrix form). Wiener Filter” (signal processing), etc. Here are those publications, specifically mentioned the name Kalman filter in their abstracts found from ADS.

The motivation of introducing Kalman filter although it is a very well known term is the recent Fisher Lecture given by Noel Cressie at the JSM 2009. He is the leading expert in spatial statistics. He is the author of a very famous book in Spatial Statistics. During his presentation, he described challenges from satellite data and how Kalman filter accelerated computing a gigantic covariance matrix in kriging. Satellite data of meteorology and geosciences may not exactly match with astronomical satellite data but from statistical modeling perspective, the challenges are similar. Namely, massive data, streaming data, multi dimensional, temporal, missing observations in certain areas, different exposure time, estimation and prediction, interpolation and extrapoloation, large image size, and so on. It’s not just focusing denoising/cleaning images. Statisticians want to find the driving force of certain features by modeling and to perform statistical inference. (They do not mind parametrization of interesting metric/measure/quantity for modeling or they approach the problem in a nonparametric fashion). I understood the use of Kalman filter for a fast solution to inverse problems for inference.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-kalman-filter/feed/ 0
data analysis system and its documentation http://hea-www.harvard.edu/AstroStat/slog/2009/data-analysis-system-and-its-documentation/ http://hea-www.harvard.edu/AstroStat/slog/2009/data-analysis-system-and-its-documentation/#comments Fri, 02 Oct 2009 02:11:04 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1977 So far, I didn’t complain much related to my “statistician learning astronomy” experience. Instead, I’ve been trying to emphasize how fascinating it is. I hope that more statisticians can join this adventure when statisticians’ insights are on demand more than ever. However, this positivity seems not working so far. In two years of this slog’s life, there’s no posting by a statistician, except one about BEHR. Statisticians are busy and well distracted by other fields with more tangible data sets. Or compared to other fields, too many obstacles and too high barriers exist in astronomy for statisticians to participate. I’d like to talk about these challenges from my ends.[1]

The biggest challenge for a statistician to use astronomical data is the lack of mercy for nonspecialists’ accessing data including format, quantification, and qualification[2] ; and data analysis systems. IDL is costly although it is used in many disciplines and other tools in astronomy are hardly utilized for different projects.[3] In that regards, I welcome astronomers using python to break such exclusiveness in astronomical data analysis systems.

Even if data and software issues are resolved, there’s another barrier to climb. Validation. If you have a catalog, you’ll see variables of measures, and their errors typically reflecting the size of PSF and its convolution to those metrics. If a model of gaussian assumption applied, in order to tabulate power law index, King’s, Petrosian’s, or de Vaucouleurs’ profile index, and numerous metrics, I often fail to find any validation of gaussian assumptions, gaussian residuals, spectral and profile models, outliers, and optimal binning. Even if a data set is publicly available, I also fail to find how to read in raw data, what factors must be considered, and what can be discarded because of unexpected contamination occurred like cosmic rays and charge over flows. How would I validate raw data that are read into a data analysis system is correctly processed to match values in catalogs? How would I know all entries in catalog are ready for further scientific data analysis? Are those sources real? Is p-value appropriately computed?

I posted an article about Chernoff faces applied to Capella observations from Chandra. Astronomers already processed the raw data and published a catalog of X-ray spectra. Therefore, I believe that the information in the catalog is validated and ready to be used for scientific data analysis. I heard that repeated Capella observation is for the calibration. Generally speaking, in other fields, targets for calibration are almost time invariant and exhibit consistency. If Capella is a same star over the 10 years, the faces in my post should look almost same, within measurement error; but as you saw, it was not consistent at all. Those faces look like observations were made toward different objects. So far I fail to find any validation efforts, explaining why certain ObsIDs of Capella look different than the rest. Are they real Capella? Or can I use this inconsistent facial expression as an evidence that Chandra calibration at that time is inappropriate? Or can I conclude that Capella was a wrong choice for calibration?

Due to the lack of quantification procedure description from the raw data to the catalog, what I decided to do was accessing the raw data and data processing on my own to crosscheck the validity in the catalog entries. The benefit of this effort is that I can easily manipulate data for further statistical inference. Although reading and processing raw data may sound easy, I came across another problem, lack of documentation for nonspecialists to perform the task.

A while ago, I talked about read.table() in R. There are slight different commands and options but without much hurdle, one can read in ascii data in various styles easily with read.table() for exploratory data analysis and confirmatory data analysis with R. From my understanding, statisticians do not spend much time on reading in data nor collecting them. We are interested in methodology to extract information of the population based on sample. While the focus is methodology, all the frustrations with astronomical data analysis softwares occur prior to investigating the best method. The level of frustration reached to the extend of terminating my eagerness for more investigation about inference tools.

In order to assess those Capella observations, thanks to its on-site help, I evoke ciao. Beforehand, I’d like to disclaim that I exemplify ciao to illustrate the culture difference that I experienced as a statistician. It was used to discuss why I think that astronomical data analysis systems are short of documentations and why that astronomical data processing procedures are lack of validation. I must say that I confront very similar problems when I tried to learn astronomical packages such as IRAF and AIPS. Ciao happened to be at handy when writing this post.

In order to understand X-ray data, not only image data files, one also needs effective area (arf), redistribution matrix (rmf), and point spread function (psf). These files are called by calibration data files. If the package was developed for general users, like read.table() I expect there should be a homogenized/centralized data including calibration data reading function with options. Instead, there were various kinds of functions one can use to read in data but the description was not enough to know which one is doing what. What is the functionality of these commands? Which one only stores names of data file? Which one reconfigures the raw data reflecting up to date calibration file? Not knowing complete data structures and classes within ciao, not getting the exact functionality of these data reading functions from ahelp, I was not sure the log likelihood that I computed is appropriate or not.

For example, there are five different ways to associate an arf: read_arf(), load_arf(), set_arf(), get_arf(), and unpack_arf() from ciao. Except unpack_arf(), I couldn’t understand the difference among these functions for accessing an arf[4] Other softwares including XSPEC that I use, in general, have a single function with options to execute different level of reading in data. Ciao has an extensive web documentation without a tutorial (see my post). So I read all ahelp “commands” a few times. But I still couldn’t decide which one to use for my work to read in arfs and rmfs (I happened to have many calibration data files).

arf rmf psf pha data
get get_arf get_rmf get_psf get_pha get_data
set set_arf set_rmf set_psf set_pha set_data
unpack unpack_arf unpack_rmf unpack_psf unpack_pha unpack_data
load load_arf load_rmf load_psf load_pha load_data
read read_arf read_rmf read_psf read_pha read_data

[Note that above links may not work since ciao documentation website evolves quickly. Some might be routed to different links so please, check this website for other data reading commands: cxc.harvard.edu/sherpa/ahelp/index_alphabet.html].

So, I decide to seek for a help through cxc help desk several months back. Their answers are very reliable and prompt. My question was “what are the difference among read_xxx(), load_xxx(), set_xxx(), get_xxx(), and unpack_xxx(), where xxx can be data, arf, rmf, and psf?” The answer to this question was that

You can find detailed explanations for these Sherpa commands in the “ahelp” pages of the Sherpa website:

http://cxc.harvard.edu/sherpa/ahelp/index_alphabet.html

This is a good answer but a big cultural shock to a statistician. It’s like having an answer like “check http://www.r-project.org/search.html and http://cran.r-project.org/doc/FAQ/R-FAQ.html” for IDL users to find out the difference between read.table() and scan(). Probably, for astronomers, all above various data reading commands are self explanatory like R having read.table(), read.csv(), and scan(). Disappointingly, this answer was not I was looking for.

Well, thanks to this embezzlement, hesitation, and some skepticism, I couldn’t move to the next step of implementing fitting methods. At the beginning, I was optimistic when I found out that Ciao 4.0 and up is python compatible. I thought I could do things more in statistically rigorous ways since I can fake spectra to validate my fitting methods. I was thinking about modifying the indispensable chi-square method that is used twice for point estimation and hypothesis testing that introduce bias (a link made to a posting). My goal was make it less biased and robust, less sensitive iid Gaussian residual assumptions. Against my high expectation, I became frustrated at the first step, reading and playing with data to get a better sense and to develop a quick intuition. I couldn’t even make a baby step to my goal. I’m not sure if it a good thing or not, but I haven’t been completely discouraged. Also, time helps gradually to overcome this culture difference, the lack of documentation.

What happens in general is that, if a predecessor says, use “set_arf(),” then the apprentice will use “set_arf()” without doubts. If you begin learning on your own purely relying on documentations, I guess at some point you have to make a choice. One can make a lucky guess and move forward quickly. Sometimes, one can land on miserable situation because one is not sure about his/her choice and one cannot trust the features appeared after these processing. I guess it is natural to feel curiosity about what each of these commands is doing to your data and what information is carried over to the next commands in analysis procedures. It seems righteous to know what command is best for the particular data processing and statistical inference given the data. What I found is that such comparison across commands is missing in documentations. This is why I thought astronomical data analysis systems are short of mercy for nonspecialists.

Another thing I observed is that there seems no documentation nor standard procedure to create the repeatable data analysis results. My observation of astronomers says that with the same raw data, the results by scientist A and B are different (even beyond statistical margins). There are experts and they have knowledge to explain why results are different on the same raw data. However, not every one can have luxury of consulting those few experts. I cannot understand such exclusiveness instead of standardizing the procedures through validations. I even saw that the data that A analyzed some years back can be different from this year’s when he/she writes a new proposal. I think that the time for recreating the data processing and inference procedure to explain/justify/validate the different results or to cover/narrow the gap could have not been wasted if there are standard procedures and its documentation. This is purely a statistician’s thought. As the comment in where is ciao X?[5] not every data analysis system has to have similar design and goals.

Getting lost while figuring out basics (handling, arf, rmf, psf, and numerous case by case corrections) prior to applying any simple statistics has been my biggest obstacle in learning astronomy. The lack of documenting validation methods often brings me frustration. I wonder if there’s any astronomers who lost in learning statistics via R, minitab, SAS, MATLAB, python, etc. As discussed in where is ciao X? I wish there is a centralized tutorial that offers basics, like how to read in data, how to do manipulate datum vector and matrix, how to do arithmetics and error propagation adequately not violating assumptions in statistics (I don’t like the fact that the point estimate of background level is subtracted from observed counts, random variable when the distribution does not permit such scale parameter shifting), how to handle and manipulate fits format files from Chandra for various statistical analysis, how to do basic image analysis, how to do basic spectral analysis, and so on with references[6]

  1. This is quite an overdue posting. Links and associated content can be outdated.
  2. For the classification purpose, data with clear distinction between response and predictor variables so called a training data set must be given. However, I often fail to get processed data sets for statistical analysis. I first spend time to read data and question what is outlier, bias, or garbage. I’m not sure how to clean and extract numbers for statistical analysis and every sub-field in astronomy have their own way to clean to be fed into statistics and scatter plots. For example, image processing is still executed case by case via trained eyes of astronomers. On the other hand, in medical imaging diagnosis specialists offer training sets with which scientists in computer vision develop algorithms for classification. Such collaboration yields accelerated, automatic but preliminary diagnosis tools. A small fraction of results from these preliminary methods still can be ambiguous, i.e. false positive or false negative. Yet, when such ambiguous cancerous cell images at the decision boundaries occur, specialists like trained astronomers scrutinize those images to make a final decision. As medical imaging and its classification algorithms resolve the issue of expert shortage under overflowing images, I wish astronomers adopt their strategies to confront massive streaming images and to assist sparse trained astronomers
  3. Something I like to see is handling background statistically in high energy astrophysics. When simulating a source, background can be simulated as well via Makov Random field, kriging, and other spatial statistics methods. In reality, background is subtracted once in measurement space and the random nature of background is not interactively reflected. Regardless of available statistical methodology to reflect the nature of background, it is difficult to implement it for trial and validation because those tools are not compatible for adding statistical modules and packages.
  4. A Sherpa expert told me there is an FAQ (I failed to locate previously) on this matter. However, from data analysis perspective like a distinction between data.structure, vector, matrix, list and other data types in R, the description is not sufficient for someone who wants to learn ciao and to perform scientific (both deterministic or stochastic) data analysis via scripting i.e. handling objects appropriately. You might want to read comparing commands in Sharpa from Shepa FAQ
  5. I know there is ciaox. Apart from the space between ciao and X, there is another difference that astronomers do not care much compared to statisticians: the difference between X and x. Typically, the capital letter is for random variable and lower case letters for observation or value
  6. By the way, there are ciao workshop materials available that could function as tutorials. Please, locate them if needed.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/data-analysis-system-and-its-documentation/feed/ 0
More on Space Weather http://hea-www.harvard.edu/AstroStat/slog/2009/more-on-space-weather/ http://hea-www.harvard.edu/AstroStat/slog/2009/more-on-space-weather/#comments Tue, 22 Sep 2009 17:03:11 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3319 Thanks to a Korean solar physicist[1] I was able to gather the following websites and some relevant information on Space Weather Forecast in action, not limited to literature nor toy data.


These seem quite informative and I believe more statisticians and data scientists (signal and image processing, machine learning, computer vision, and data mining) easily collaborate with solar physicists. All the complexity, as a matter of fact, comes from data processing to be fed in to (machine, statistical) learning algorithms and defining the objectives of learning. Once settled, one can easily apply numerous methods in the field to these time varying solar images.

I’m writing this short posting because I finally found those interesting articles that I collected for my previous post on Space Weather. After finding them and scanning through, I realized that methodology-wise they only made baby steps. You’ll see a limited number key words are repeated although there is a humongous society of scientists and engineers in the knowledge discovery and data mining.

Note that the objectives of these studies are quite similar. They described machine learning for the purpose of automatizing the procedure of detecting features of interest of the Sun and possible forecasting relevant phenomena that affects our own atmosphere due to associated solar activities.

  1. Automated Prediction of CMEs Using Machine Learning of CME – Flare Associations by Qahwaji et al. (2008) in Solar Phy. vol 248, pp.471-483.
  2. Automatic Short-Term Solar Flare Prediction using Machine Learning and Sunspot Associations by Qahwaji and Colak (2007) in Solar Phy. vol. 241, pp. 195-211

    Space weather is defined by the U.S. National Space Weather Probram (NSWP) as “conditions on the Sun and in the solar wind, magnetosphere, ionosphere, and thermosphere that can influence the performance and reliability of space-borne and ground-based technological systems and can endanger human life or health”

    Personally thinking, the section of “jackknife” needs to be replaced with “cross-validation.”

  3. Automatic Detection and Classification of Coronal Mass Ejections by Qu et al. (2006) in Solar Phy. vol. 237, pp.419-431.
  4. Automatic Solar Filament Detection Using image Processing Techniques by Qu et al. (2005) in Solar Phy., vol. 228, pp. 119-135
  5. Automatic Solar Flare Tracking Using Image-Processing Techniques by Qu, et al. (2004) in Solar Phy. vol. 222, pp. 137-149
  6. Automatic Solar Flare Detection Using MLP, RBF, and SVM by Qu et al. (2003) in Solar Phy. vol. 217, pp.157-172. pp. 157-172

I’d like add a survey paper on another type of learning methods beyond Support Vector Machine (SVM) used in almost all articles above. Luckily, this survey paper happened to address my concern about the “practices of background subtraction” in high energy astrophysics.

A Survey of Manifold-Based Learning methods by Huo, Ni, Smith
[Excerpt] What is Manifold-Based Learning?
It is an emerging and promising approach in nonparametric dimension reduction. The article reviewed principle component analysis, multidimensional scaling (MDS), generative topological mapping (GTM), locally linear embedding (LLE), ISOMAP, Laplacian eigenmaps, Hessian eigenmaps, and local tangent space alignment (LTSA) Apart from these revisits and comparison, this survey paper is useful to understand the danger of background subtraction. Homogeneity does not mean constant background to be subtracted, often cause negative source observation.

More collaborations among multiple disciplines are desired in this relatively new field. For me, it is one of the best data and information scientific fields of the 21st century and any progress will be beneficial to human kind.

  1. I must acknowledge him for his kindness and patience. He was my wikipedia to questions while I was studying the Sun.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/more-on-space-weather/feed/ 0
[MADS] compressed sensing http://hea-www.harvard.edu/AstroStat/slog/2009/mads-compressed-sensing/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-compressed-sensing/#comments Fri, 11 Sep 2009 04:20:54 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1904 Soon it’ll not be qualified for [MADS] because I saw some abstracts with the phrase, compressed sensing from arxiv.org. Nonetheless, there’s one publication within refereed articles from ADS, so far.

http://adsabs.harvard.edu/abs/2009MNRAS.395.1733W.
Title:Compressed sensing imaging techniques for radio interferometry
Authors: Wiaux, Y. et al.
Abstract: Radio interferometry probes astrophysical signals through incomplete and noisy Fourier measurements. The theory of compressed sensing demonstrates that such measurements may actually suffice for accurate reconstruction of sparse or compressible signals. We propose new generic imaging techniques based on convex optimization for global minimization problems defined in this context. The versatility of the framework notably allows introduction of specific prior information on the signals, which offers the possibility of significant improvements of reconstruction relative to the standard local matching pursuit algorithm CLEAN used in radio astronomy. We illustrate the potential of the approach by studying reconstruction performances on simulations of two different kinds of signals observed with very generic interferometric configurations. The first kind is an intensity field of compact astrophysical objects. The second kind is the imprint of cosmic strings in the temperature field of the cosmic microwave background radiation, of particular interest for cosmology.

As discussed, reconstructing images from noisy observations is typically considered as an ill-posed problem or an inverse problem. Owing to the personal lack of comprehension in image reconstruction of radio interferometry observation based on sample from Fourier space via inverse Fourier transform, I cannot judge how good this new adaption of compressed sensing for radio astronomical imagery is. I think, however, compressed sensing will take over many of traditional image reconstruction tools due to their shortage in forgiving sparsely represented large data/images .

Please, check my old post on compressed sensing for more references to the subject like the Rice university repository in addition to references from Wiaux et al. It’s a new exciting field with countless applications, already enjoying wide popularity from many scientific and engineering fields. My thought is that well developed compressed sensing algorithms might resolve bandwidth issues in satellite observations/communication by transmiting more images within fractional temporal intervals for improved image reconstruction.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-compressed-sensing/feed/ 0
[ArXiv] component separation methods http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-component-separation-methods/ http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-component-separation-methods/#comments Tue, 08 Sep 2009 15:17:34 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3502 I happened to observe a surge of principle component analysis (PCA) and independent component analysis (ICA) applications in astronomy. The PCA and ICA is used for separating mixed components with some assumptions. For the PCA, the decomposition happens by the assumption that original sources are orthogonal (uncorrelated) and mixed observations are approximated by multivariate normal distribution. For ICA, the assumptions is sources are independent and not gaussian (it grants one source component to be gaussian, though). Such assumptions allow to set dissimilarity measures and algorithms work toward maximize them.

The need of source separation methods in astronomy has led various adaptations of decomposition methods available. It is not difficult to locate those applications from journals of various fields including astronomical journals. However, they are most likely soliciting one dimension reduction method of their choice over others to emphasize that their strategy works better. I rarely come up with a paper which gathered and summarized component separation methods applicable to astronomical data. In that regards, the following paper seems useful to overview methods of reducing dimensionality for astronomers.

[arxiv:0805.0269]
Component separation methods for the Planck mission
S.M.Leach et al.
Check its appendix for method description.

Various library/modules are available through software/data analysis system so that one can try various dimension reduction methods conveniently. The only concern I have is the challenge of interpretation after these computational/mathematical/statistical analysis, how to assign physics interpretation to images/spectra produced by decomposition. I think this is a big open question.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-component-separation-methods/feed/ 2
[ArXiv] Statistical Analysis of fMRI Data http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-statistical-analysis-of-fmri/ http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-statistical-analysis-of-fmri/#comments Wed, 02 Sep 2009 00:43:13 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3304

[arxiv:0906.3662] The Statistical Analysis of fMRI Data by Martin A. Lindquist
Statistical Science, Vol. 23(4), pp. 439-464

This review paper offers some information and guidance of statistical image analysis for fMRI data that can be expanded to astronomical image data. I think that fMRI data contain similar challenges of astronomical images. As Lindquist said, collaboration helps to find shortcuts. I hope that introducing this paper helps further networking and collaboration between statisticians and astronomers.

List of similarities

  • data acquisition: data read in frequency domain and image reconstruction via inverse Fourier transform. (To my naive eyes, It looks similar to Power Spectrum Analysis for cosmic microwave background (CMB) data).
  • amplitudes or coefficients are cared and analyzed, not phase nor wavelets.
  • understanding data:brain physiology or physics like cosmological models that describe data generating mechanism.
  • limits in/trade-off between spatial and temporal resolution.
  • understanding/modeling noise and signal.

These similarities seem common for statistically analyzing images from fMRI or telescopes. Notwithstanding, no astronomers can (or want) to carry out experimental design. This can be a huge difference between medical and astronomical image analysis. My emphasis is that because of these commonalities, strategies in preprocessing and data analysis for fMRI data can be shared for astronomical observations and vise versa. Some sloggers would like to check Section 6 that covers various statistical models and methods for spatial and temporal data.

I’d rather simply end this posting with the following quotes, saying that statisticians play a critical role in scientific image analysis. :)

There are several common objectives in the analysis of fMRI data. These include localizing regions of the brain activated by a task, determining distributed networks that correspond to brain function and making predictions about psychological or disease states. Each of these objectives can be approached through the application of suitable statistical methods, and statisticians play an important role in the interdisciplinary teams that have been assembled to tackle these problems. This role can range from determining the appropriate statistical method to apply to a data set, to the development of unique statistical methods geared specifically toward the analysis of fMRI data. With the advent of more sophisticated experimental designs and imaging techniques, the role of statisticians promises to only increase in the future.

A full spatiotemporal model of the data is generally not considered feasible and a number of short cuts are taken throughout the course of the analysis. Statisticians play an important role in determining which short cuts are appropriate in the various stages of the analysis, and determining their effects on the validity and power of the statistical analysis.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/arxiv-statistical-analysis-of-fmri/feed/ 0