Forget for a moment the acquisition.
Now you have a RGB image. This means, there are three coordinate values, each indicating a intensity for a given color. This is what is called a color space. You may think as each RGB channel as orthogonal coordinates, and every "color" is inside a square of 1 each side (for a normalized range). Ok. But, you may remember the old black and white televisions, or your vision at night. You don't see colors, just grayscale intensities. How is translated a 3 dimensional value, the color, into a single dimension value, grayscale? You have to take proportions of each RGB value. In the cube, RGB color space, this is a straight line that runs across the cube in some way.
Right now we have a way to define the luminance of any image. What is the chrominance? Remember the cube. Any "color" may be described as 3 independent R, G and B values. If we know the luminance associated to that color, and take that information as a new coordinate, we only need 2 other coordinates to reach the original point. Think about vectors. The chrominance is what is missing to the luminance to become a "color". It is just the information of the hue and saturation. In fact, this is a way to establish the chrominance coodinates: Hue is the angle around the luminance; where to point, and saturation is how far from the luminance line it lyes. There are many more ways to set those values, and hence are the variety of color spaces: Lab, Lch... and different ways to set the "luminance value": HSV, HSI, etc.
So, this has nothing to do with the capture method. If you capture L, R, G and B data, you are replacing the luminance from the RGB filters with the captured luminance. The idea behind this method is that the human eye is not as good to discriminate chrominance values than luminance, and for this you may use 2x2 binning for RGB data, and use it's chrominance to complement the captured luminance.