How to Automate Music Piano Transcription With Transformers

Music Piano Transcription With Transformers

Music piano transcription, or the process of turning an audio recording of a musical piece into a sheet music score, is a challenging task. This is one of the core tasks of Music Information Retrieval (MIR), and despite the significant advances made in this field over the years, it still remains difficult to automate. The challenge lies in converting a raw audio recording into a series of note events that represent precise onset/offset timings and velocities, as well as a continuous melody that is mapped to a metrical grid.

Traditionally, this has been done by hand, but the recent introduction of automated musical transcription software has led to a rapid development of machine learning approaches to this problem. Among these, the most successful have been those based on deep neural network architectures.

https://www.tartalover.net/

These use separate model stacks for detecting piano note onsets, and for detecting the continuing presence of each note (i.e., frames). A number of these models also incorporate extra features such as the detection of pedal events and the modeling of the dynamic range of a given piano performance.

How to Automate Music Piano Transcription With Transformers

The first step in this process is to identify the pitch of a given note in the musical recording. This is done by filtering out the component frequencies of a given harmonic composition, which correspond to different pitches in the audio spectrum. For each harmonic, the resulting filter output is a list of frequency values, which are then sorted to extract the highest value. This value is then used as the pitch for that particular note.

Once the pitch of a note is determined, its duration can be calculated by adding up the individual onset times for each note in the analyzed sound clip. This duration is then mapped into a standard music notation time scale, which can be used to generate a corresponding music score.

This step requires careful listening to the audio, to form a clear picture in your mind of what is being played. It is often useful to use a DAW for playback while transcribing, so that you can efficiently move through the audio file using keyboard shortcuts.

In order to improve the performance of music transcription systems, it is necessary to consider how the model can adapt to changes in the musical context that it is attempting to capture. For example, if the music being transcribed is performed on a different type of instrument than the one that was used to train the system, it is important for the system to be able to adjust its behavior to account for the difference in acoustic properties between the instruments.

Several studies have explored this issue by modifying the basic architecture of Transformer to make it more flexible for supporting new types of music transcription tasks. Specifically, these studies have added a “pitch tensor” that represents the pitch of each note in the MIDI file, and have modified the occurrence of note events to take into account the varying lengths of the MIDI notes. Using this approach, it is possible for the system to generate musical phrases that remain coherent over a longer time span than would be achievable with vanilla Transformer and other state-of-the-art LSTM-based music transcription systems.