For some restrictive budgetary reasons, and for some artistic reasons, my audio timeline is very simple.
What are those artistic reasons?
As sound guys you're probably thinking this sounds terrible, but lets say you take my word for it that it sounds passable to the general audience.
With all due respect, why would I take your word for it when I've been doing sound as a speciality professionally for 20 years and what you are saying goes against everything I've read on the subject (which is a lot!), everything I've learned (which is also quite a lot) and everything I've experienced (which is more than a fair bit), from professional commercial filmmakers, sound professionals who have been in the business longer than me and not to mention all the commercial films I've seen?
What you have described certainly would not be passable to a general audience, with maybe a few very rare exceptions, so rare in fact that I can't think of any examples. So I'm sorry but not only can't I take your word for it but what you have described appears so ridiculous that it piques my interested as to how you could come to believe it?! Of course, all this depends on what you mean by "passable". Do you mean that a general audience can hear and understand the dialogue or do you mean that a general audience would actually be engaged by and interested in the story you are trying to tell?
Also, are you sure you would mix your source/diegetic music in stereo? To me it seems like mono is the way to go because it's only coming from one location within the frame of the scene, the audience and characters are not standing in between speakers or wearing headphones...
You are thinking like a photographer, NOT like a filmmaker!! What is "within the frame of the scene" is (or should be!), only a part of the story. As Alcove essentially asked, what is happening beyond the frame of your scene, where is the room where your characters are in?
We start hearing before we are born and we don't stop until we are dead. All the aural information received by the ear is fed into the brain and used to construct our perception of the location we are in. There is far too much aural, visual and sensory information for our brain to process in a way which allows us to be consciously aware of it all. So most of what we hear we are only aware of subconsciously or indirectly as reflected by the simplified perception of the world we experience. This could lead the photographer to assume when making a film that we can simply ignore all the sound which we would not be consciously aware of (if the scene were real) but this is a gross misunderstanding of both hearing and of perception in general. What we are hearing subconsciously forms an integral and important part of our general perception and therefore any omissions or small changes, any sounds which are unusual, out of context or unexpected will affect our perception and will cause us to question our perception to find a rational explanation. If there is no believable explanation (within our story world) then the rational explanation is that we are not experiencing any sort of reality but are just watching a film! The audience is instantly detached and disengaged from our story, even though they probably don't consciously know why. Because lo/no/micro budget filmmakers don't understand all the implications of the above or have the knowledge or equipment to implement and because they don't have the budget to employ a good experienced sound designer who does, the result is virtually always to some greater or lesser degree un-involving and un-engaging. This is why audiences in general find lo/no budget indie films slow, boring and uninteresting, and this is why it's so impossibly difficult to make a commercially successful lo/no budget film. Because not only do you have to actually make an involving and engaging film but you've also got to overcome the predjustice that lo/no budget indie films are always boring. It does not matter one iota how good your camera, cinematography or script is, if you don't know how and can't involve and engage the audience in your story, then the best you can ever hope to achieve is some praise and respect from other photographers!!
All this might appear to be esoteric nonsense but let me put it into context by using your own example, which demonstrates perfectly that you are thinking about sound in overly simplistic terms, purely in terms of your conscious perception of sound rather than in terms of what you are actually hearing and how that affects your overall perception. To simplify the argument, let's say the diagetic music in your scene is from a mono sound source, what would you be actually hearing (as opposed to perceiving)? You would be hearing a certain, relatively small, amount of the actual direct sound from the speaker but the majority of what you hear would actually be the reflections of that sound from all the walls, ceiling, floor and all the other reflective surfaces of the room. The majority of the music you hear would therefore be coming from all around you! This is quite different to your perception though, where all you would consciously hear is the direct sound from the sound system. This mass of reflected sound is not discarded by your brain though, it's used to create the perception of the environment. This has two important consequences for the filmmaker:
1. These masses of reflections are ALWAYS present in the real world, your brain knows this (even if you don't consciously), so if these reflections are not present in your sound mix the only rational conclusion your brain will arrive at is that this scene cannot be ANY version of reality, you've disengaged your audience! and
2. The human brain is much more sophisticated than just sound reflections being present or not present. Blindfold someone and take them into a room, even a virtually silent room, and they will be able to tell you what sort of room it is; a cathedral, a toilet, a sitting room, a basement, etc. They probably won't be able to describe how they know what sort of room they are in, they'll just feel it and know. With a bit more time and concentration they might even be able to tell you more detail about the room, rough dimensions, construction materials and other details. The brain gets all this information from subconscious processes which analyse and interpret a range of time and frequency properties both between individual reflections and between the original sound source. The brain doesn't just discard all this analysis and data once we take off the blindfold, it is integrated into our overall perception which now includes our vision. So for the filmmaker, just adding any old reflections won't do, because if you add a reverb with the reflection characteristics of a toilet, to a scene set in a sitting room, you are setting up a conflict between the visual and aural data and the audiences' brains cannot create a viable perception. Without a viable perception, your film has now become at best surreal or abstract but more likely just un-involving and boring, a long way in my book from "passable"!
I've just dealt in this example with one sound source in an environment but of course there is never just one sound source in any real environment. Our sitting room is in an apartment or house, that house is situated in some type of city or rural location, all of which have their own range of sounds, all of which our brains hear, processes and builds into our final perception, even if they are too quiet for us to be consciously aware of them. Take those sounds away though and suddenly we are effectively trying to create a cinematic experience which our audience cannot relate to! Which brings us back to un-involving and boring.
If you've understood the principles I'm trying to explain, you'll realise that they don't just apply to sitting room dramas but just as much to fantastical sci-fi and animated films and pretty much all film. Hopefully you've also got a deeper understanding of what Alcove and I mean by "sound is half the experience" and now understand that most of your questions in this thread are nonsensical and could not really be answered even if they were hypothetical.
G