It is relatively easy to consider this sort of translation within the limited confines of classical music. The challenge magnifies when we consider something as familiar as the expressiveness of the human voice. The human voice in song (think of Joni Mitchell for example) expresses itself through a tight integration of pitch, pitch inflection, vibrato, pure tone and sudden timbral grit, sound play with words, breath, a hint of laughter, colouring the words themselves to express emotional states of great depth and range. The power of this integration of separate features within what is generally considered simple and straightforward music poses overwhelming challenges to translation.
Part of this power comes from the amount of our auditory perception that is dedicated to perceiving and interpreting the human voice. A similarly large portion of visual perception is dedicated to the recognition of and interpretation of the expression on human faces. For this reason it is tempting to suggest using facial expression to carry some of the features of vocal inflection. Certainly, one can gather a lot from watching the face of a singer. But I feel pretty certain that systematic algorithmic translation of vocal inflection into synthetic facial gestures would not be able to carry the same depth of expression, and could well be very irritating, if not repulsive!