Perhaps some confusion has arisen due to my choice of a wrong title for this thread. Actually, the most important part of the new rejection algorithm is linear fit, not RANSAC. The use of the RANSAC algorithm is an implementation detail, but the core of the method is that we are fitting all pixels in a stack to a straight line. In fact I am considering the possibility of not using RANSAC at all, in favor of other robust algorithms that are more efficient.
The difference between linear fit clipping and sigma clipping is best explained with a graphical example. Consider the following figure.
(http://forum-images.pixinsight.com/II/sigma-clipping.png)
Suppose we are integrating N images. In the figure, we have represented a stack of N pixels at the same coordinates, whose values have been sorted in ascending order and plotted as circles on the graphic. The horizontal axis is the ordered sample number, and the vertical axis represents the available range of pixel values.
What we pursue with pixel rejection is to exclude those pixels that have spurious values, such as pixels pertaining to cosmic rays, plane and satellite trails, and other too high or too low pixels that are not valid data for some reason. We want to reject those pixels to make sure that they won't enter the integrated image.
The symbol m represents the median of the distribution. The median is the value of the central element in the ordered sequence, which in statistical terms corresponds to the most probable value in the distribution. Here, the median is being used as a robust estimate of the true value of the pixel, or more realistically, the most representative value of the pixel that we have available. Robustness here means unaffected by outliers. An outlier is a value abnormally low or high, as are both extreme values in the graphic, for example. Outliers are precisely those values that we want to exclude, or reject. By contrast, valid pixel values or inliers are those that we want to keep and integrate as a single image to achieve better signal-to-noise ratio. Our goal is to reject only really spurious pixels. We need a very accurate method able to distinguish between valid and not valid data in a smart way, adapted to the variety of problems that we have in our images. Ideally no valid data should be rejected at all.
With two horizontal lines we have represented the clipping points for high and low pixels, which are the pixels above and below the median of the stack, respectively. Pixels that fall outside the range defined by both clipping points are considered as outliers, and hence rejected. In the sigma clipping algorithm, as well as in its Winsorized variant, clipping points are defined by two multipliers in sigma units, namely KH and KL in the figure. Sigma is an estimate of the dispersion in the distribution of pixel values. In the standard sigma clipping algorithms, sigma is taken as the standard deviation of the pixel stack. These algorithms are iterative: the clipping process is repeated until no more pixels can be rejected.
In the figure, rejected pixels have been represented as void circles. Six pixels have been rejected, three at each end of the distribution. The question is, are these pixels really outliers? The answer is most likely yes, under normal conditions. Normal conditions include that the N images have flat illumination profiles. Suppose that some of the images have slight additive sky gradients of different intensities and orientations —for example, one of the images is slightly more illuminated toward its upper half, while another image is somewhat more illuminated on its lower half. Due to varying gradients, the sorted distribution of pixels can show tails of apparent outliers at both ends. This may lead to the incorrect rejection of valid pixels with a sigma clipping algorithm.
Enter linear fit clipping:
(http://forum-images.pixinsight.com/II/linear-fit-clipping.png)
In this figure we have the same set of pixels as before. However, the linear fit clipping algorithm is being used instead of sigma clipping. The new algorithm fits a straight line to the set of pixels. It tries to find the best possible fit in the sense of minimizing average deviation. Average deviation is computed as the mean of the distances of all pixel values to the line. When we find a line that minimizes average deviation, what we have found is the actual tendency in the set of pixel values. Note that a linear fit is more robust than a sigma clipping scheme, in the sense that the fitted line does not depend, to a larger extent, on all the images having flat illumination profiles. If all the images are flat (and have been normalized, as part of the rejection procedure), then the fitted line will tend to be horizontal. If there are illumination variations, then the line's slope will tend to grow. In practice, a linear fit is more tolerant to slight additive gradients than sigma clipping.
The neat result is that less false outliers are being rejected, and hence more pixels will enter the final integrated image, improving SNR. In the figure, r is the fitted line and d represents average deviation. HL and KH, as before, are two multipliers for high and low pixels, in average deviation units. However, instead of classifying pixels with respect to a fixed value (the median in the case of sigma clipping), in the linear fit algorithm low and high pixels are those that lie below and above the fitted line, respectively. Note that in the linear fit case only two pixels have been rejected as outliers, while the set of inlier pixels fit to a straight line really well.
The remaining question is: is linear fit clipping the best rejection algorithm? I wouldn't say that. Linear fit rejection requires a relatively large set of images to work optimally. With less than 15 images or so its performance may be worse or similar to Winsorized sigma clipping. To reject pixels around a fitted line with low uncertainty, many points are required in order to optimize it in the least average deviation sense. The more images, the better linear fit is possible, and hence the new algorithm will also perform better. It's a matter of testing it and see if it can yield better results than the other rejection algorithms available, for each particular case.
Hi Juan,
Enter linear fit clipping:
(http://forum-images.pixinsight.com/II/linear-fit-clipping.png)
...
The remaining question is: is linear fit clipping the best rejection algorithm? I wouldn't say that. Linear fit rejection requires a relatively large set of images to work optimally. With less than 15 images or so its performance may be worse or similar to Winsorized sigma clipping. To reject pixels around a fitted line with low uncertainty, many points are required in order to optimize it in the least average deviation sense. The more images, the better linear fit is possible, and hence the new algorithm will also perform better. It's a matter of testing it and see if it can yield better results than the other rejection algorithms available, for each particular case.
...
I must admit that this method of identifying outliers and the "true value" of a pixel appeared extremely suspicious to me. After all, who says that a distribution without outliers would be well fitted by a linear function after sorting the samples? So I did an experiment with Excel, generating 50 samples with a poisson distribution (as a CCD would do), and adding some polluting signal (a sine) to this that was intended to simulate changing sky conditions such as haze. As can be seen in the attached screenshot or Excel 2007 file, your approach would come close the "true" value of 8000 in both the unpolluted and the polluted case. Quite a surprise for me.
Is there any literature that describes the method? What are known cases where this method produces bad estimates?
Kind regards,
Georg
Hi,
I did some additional research.
(http://forum-images.pixinsight.com/II/linear-fit-clipping.png)
In essence, this image is a Q-Q plot http://en.wikipedia.org/wiki/Q-Q_plot for a uniform distribution http://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29, i.e. when all values generated by the random process (here: CCD values) are equally likely within a certain range, the plot will be similar to a straight line. For distributions where values at the extremes become less likely (e.g. Poisson or normal distribution), the curve will always look more or less like an inverted S.
Basically, finding "outliers" always means that you have some idea how the data should be distributed, and filtering out those that do niot fit under this assumption. I think for CCD data, assuming a Poisson distribution http://en.wikipedia.org/wiki/Poisson_distribution is a valid approach. So to further refine this approach, it may be a good idea to use a Q-Q plot for a poisson distribution instead of one for a uniform distribution.
Just some ideas, not practically tested....
Georg