This is incorrect, it is not about more or less light falling onto the photosites, it's about frequency of the change (which defines the edges) against sampling frequency. Per Nyquist theorem, you need at least 2 times oversampling for a signal to be properly sampled. Using the example ionicbeam given above, if a lens' resolution is 1 micron, the sensor will need to resolve at least 0.5 microns to out resolve the lens. This means at least 2 pixels are needed in a space of 1 micron as mentioned above.
Let's say with a 12MP full frame sensor, a lens out resolve it by a factor of 4. In this case, you get the full resolution from the sensor, but the lens has more to offer, and can still resolve properly even when the resolution of the sensor is doubled. Now let's consider a 48MP full frame sensor, which has root(4)=2x more resolution than the 12MP. In this case, the resolving power of both the lens and sensor matches each other and is at the optimum. Once the sensor goes beyond 48MP, the sensor will now out resolve the lens.
Does going from 12MP to 48MP resulting in a "lower detailed pic"? No. Because instead of limited by the sensor resolving power, both lens and sensor are now equal and you get the max possible resolution from both. Does going from 12MP to >48MP resulting in a "lower detailed pic"? Again no. Now the limiting factor is the lens, and yes the resolution is limited by it, but you still get the max resolution from the lens and the details will not be lower (and in fact it is higher than 12MP). You're just not fully utilizing the sensor's resolving power in this case, and of course if you upgrade the lens to one with higher resolving power you'll rip the benefits of the sensor.