Turning a photo into music

Once upon a time, not so long ago, I found myself wanting to create some music, and to try something new while doing it. I downloaded a bunch of programs from the 'net, and began experimenting. Some of the programs weren't worth the time they took to download, some were just so cool I spent hours and hours playing with them, but none of them were giving me exactly what I was looking for.

I did come across one which purported to make music from images, and it did...sort of...if you don't mind a lot of squeak-squawk-boop-boop-fweep-waaaaaw sounds. I don't mind them - in small doses - but I still wasn't getting anything close to what most people consider music.

MIDImage was created as a result of all that. Luckily, I don't have a life, so I can take the time to write this sort of program. With MIDImage, a number of parameters are set, then the program examines the image and produces a MIDI file from it. What could be simpler? (Whether most people would consider the output "music" is still up in the air. It should be noted, however, that MIDImage was not really designed to create finished pieces...it was intended to be just another tool in the musician's toolbox, to be used along with everything else. I use it by itself because I just think it's so cool that photographs can do this. Just bear in mind that I've also been known to sing along with the clothesdryer, and am often fascinated by shiny objects.)

Li: original photoAnyway, as an example, here's an image I want to turn into a song. (Okay...I admit it...almost all of the music I've done lately has been derived from photos of beautiful women.1 I just can't seem to get inspired by a photo of a Golden Retriever or a sailboat. Hopefully, that particular problem is incurable.)

So, first thing is to decide what instruments I want in the piece. I'll use SoundFonts, rather than just go with the built-in instruments of the soundcard. They sound better, usually more realistic, and also make available a wider range of instruments.

Decisions, decisions. Should it be a Linear or a Banded scan?2 A Linear scan will start at the top-left corner of the image and scan line-by-line down to the bottom. A Banded scan will give each track a horizontal section of the image, then scan each section from left to right. There is also the Random scan option, of course, but I hardly ever use it.

The photo doesn't really look like it should be a Linear scan, but...I like repeating patterns and themes, and Linear scans do an admirable job of it.

Next, the chosen instruments are associated with the tracks. 15 tracks is usually a bit too many, but what the heck. And remap Track 10, which defaults to the built-in percussion instruments, so "Track 10" is really Track 11, "Track 11" is really Track 12, and so on up to "Track 15", which maps to Track 16.

Just to keep things almost simple, I'll use a Key Signature of C Phrygian. This means the only notes that will sound are C C# D# F G G# & A#, assuming any of the pixels map to those notes. Photographs usually have a wide range of color-values, so there shouldn't be any problems as far as that is concerned. For now, I only use the Red component of the pixels. If need be, I could look at the Green or Blue components, or even combinations of the three.

Li: 15 bands The photo is too large, even using 15 Tracks, if every pixel is scanned and converted to a MIDI note. I'm looking for a result of about 3 minutes, so a little experimentation is needed.

It turns out that only using every third pixel gives me what I want. Those vertical lines in the image bands are indicators of which pixels are being scanned, and which are being ignored. You can probably see the skip-two-pixels, read-a-pixel pattern.

With a properly designed SoundFont, each instrument defaults to its full useable range of notes, but that is usually much too wide a range. A little trimming is probably a good idea. And, since some of the instruments are duplicated, I'll split the ranges into octaves.

With C4 being Middle-C, the pixels will be mapped to fall into the following ranges:

           Track 1 C1 - C4     Track 6  C3 - C5    Track 11 C4 - C7
           Track 2 C2 - C#7    Track 7  C1 - C3    Track 12 C2 - C5
           Track 3 E2 - C4     Track 8  C2 - C6    Track 13 C3 - C6
           Track 4 C3 - C6     Track 9  C2 - C5    Track 14 C4 - C5
           Track 5 C2 - C3     Track 10 C3 - C6    Track 15 C2 - C6

Dynamics also play an important part in all this. If every note sounds with the same volume, it can become very boring...or annoying, take your pick. Setting a minimum and maximum volume for each track, with the individual note-volume dependent on the color of the pixel, makes for a much nicer effect. For instance, a dark Red pixel will sound quieter than a bright Red pixel. (Velocity is what "Volume" is called in the world of MIDI. Most people would just say "volume" and be done with it, but the official term is Velocity. There is a "volume" in MIDI, but that usually just refers to the overall loudness of the piece.)

And, while we're at it, why not control the volume even further by adjusting the level over different parts of the song? What the heck. I'll start with the Bass, then let the other instruments sneak in bit by bit.

Track 1 (the Acoustic Bass) will use only the color-value of the pixel to determine the volume, staying within a range of 40 to 70 out of a possible 127 (127 being the loudest).

Tracks 2 and 8 through 15 will start at zero volume, then steadily increase in volume as the song goes along.

Tracks 3 through 7 will increase in volume one after the other, starting when 13% of the song has passed, each one coming in after an additional 10% or so has played.

If that's not confusing enough, the base volume is still determined by the color of the pixel. The additional control just determines what percentage of the "raw" volume will be used. If the "raw" volume, based on the pixel color, is 60, but the additional control is at 30% at that point in the song, the volume will be set at 18 for that note.

To add some more flavoring, some tracks will tie the same note/colors together, some will tie notes and non-notes together (instead of inserting silence when a note falls outside the Key Signature or octave range), and some will increase or decrease the length of the notes, depending on the color-value. A nice assortment of base duration values will also be used.

That's pretty much it. Here is the result. (In MP3 format, since the chances of you having the same SoundFonts are about the same as me winning the lottery...) And, if you're interested, here are the actual settings used: Settings. If you click on the MIDI link, you'll hear what it would sound like with the "wrong" SoundFonts (or the default sounds of your MIDI player).

Looking back at what I've just written, it sounds more complicated than it really is. It just takes a bit of practice, a bit of experimenting to get a feel for it. It can even get to the point where you can achieve an approximation of the sound you want without too many false starts, just leaving a little fine-tuning to be done.

More of this music can be found here, if you're interested.


Notes

1 The beautiful woman in this case is Lisa Dalbello, a wonderfully talented songwriter/singer from Canada.

2A digitized image is made up of a gazillion tiny dots called pixels. Each pixel can have a Red value of 0 to 255, a Blue value from 0 to 255, and a Green value from 0 to 255.

MIDI notes can have a range from 0 to 127 (with 60 usually being "Middle C"). MIDImage scans the pixels, determines which value to use, then finds the corresponding MIDI note. If it falls within the designated Key Signature and note/octave range, it is used. If it's outside the desired set of notes, it is discarded (unless MIDImage has been configured to tie un-used notes to used notes).

The terms scan and scanning above refer to examining a pixel, determining its Red, Green, and/or Blue value, then mapping that value to a MIDI note. It had to be called something, and "scanning" seemed like a good choice.