Discovering Music Relations with Sequential Attention

Junyan Jiang¹, Gus G. Xia¹, Taylor Berg-Kirkpatrick²
¹MusicX Lab, NYU Shanghai
²University of California San Diego

Code repositoryPaperVideo

Notice: If you have trouble with the interactive MIDI players on the page, consider using Chrome, Edge or Firefox. You can drag the piano roll using mouse or touch devices to change the playing position.

A piece generated by the proposed model. The melody line (purple) is generated given the chords (red). More generated MIDI samples are available here

1. Introduction

Music is a type of sequential data with various kinds of long-term relations, including repetition, retrograde, sequences, call-and-responses, and many more. It is crucial to model such (potentially long-term) relations in both music analysis and generation tasks.

Currently, attentive models like transformers are a popular method to capture long-term relations in a sequence. The main mechanism in these models is the element-wise attention mechanism. It is powerful at capturing element-wise similarity, but lacks inductive bias to directly compare sequences against sequences. It also requires a multi-layer setting to capture sequence-level similarity.

In this paper, we present the sequential attention module that directly models sequence-level relations in music. In this module, the type of keys and queries are no longer tokens, but sequences.

Attention module Element-wise attention Sequential attention (this paper)
Type of keys Token Sequence
Type of querys Token Sequence
Weighting method Dot(key, query) FFN(LSTM(Concat(key, query)))

2. Module Architecture

To perform sequence-wise similarity calculation, we first stack the key and query sequence together, and then feed them into a uni-directional Long Short-Term Memory (LSTM) layer. The output of the LSTM and the corresponding key token will be used to determine the matching score of the two sequences, and a predicted token for the next token of the query string.

Fig. 1. The module architecture.

Fig. 1 provides an illustrative scenario where we have two sequences of notes (Fig. 1). The first sequence is C4 D4 E4 C4 G4, and the second is A3 B3 C4 A3 ? where the question mark denotes an unknown token we want to predict. Notice that these two strings are likely to form a tonal sequence relation, and we can use this information to predict the unknown token is likely to be E4. If the module is well-trained, it will discover similar relations and use them to improve the prediction accuracy.

Fig. 2. Self-attentive layer using sequential attention modules.

We can use this module in an attentive language model (Fig. 2). To predict the next token of a partial sequence, we can regard its suffix as the query string, and its substrings as the key strings. Notice that some key strings are not well-matched with the query string, providing useless information for prediction. The normalized matching score is used as the attention weights since a higher score indicates a more important relation between the key and query sequences. We aggregate the prediction by a weighted average layer to produce the final prediction for the next token.

Fig. 3. A conditional version of the sequential attention module.

For the task of conditional sequence generation (e.g., melody generation given chord sequences), we propose the conditional version of the sequential attention module (Fig. 3). In this module, the relations of the condition sequences are also considered. Since future conditions are also revealed, the module contains a backward LSTM to capture the relations of future conditions.

3. Results

Fig. 4. The comparative results for the accuracy and the perplexity of the next token prediction task on test sets.

We showed by experiments that the model outperforms a 3-layer transformer model with relative positional encoding in the next token prediction task. We also designed some case-study examples to show what kind of relations the module is able to capture. Notice that the top 2 predictions of case (2) both form valid tonal sequences (in C major and F major keys, respectively).

Fig. 5. A case study of the module's behavior on different music relations: (1) exact repetition, (2) tonal sequence and (3) modulating sequence. The question mark is the token to predict and the (s) token is the sustain label. The table shows the top two predictions and their probability from the sequential attention model.
Even though the model is not designed for music generation in mind, we tried some music generation experiments with the model. In Fig. 6, we use the conditional self-attentive language model to generate the melody given the chords and partial melody notes.
Fig. 6. A generated sample. All chords and the melody for the first 8 bars are given. The model generates the melody for the next 8 bars. The repetitions in the generated piece are painted in colors (green and red).

4. More Generation Examples

Below we show another example where the model generates the melody from the beginning given the chords. Notice that the generated melody contains short-term and long-term repetitions, which occurs mainly at the right places (i.e., the melody repeats where the chord sequence repeats).

More generated MIDI samples are available here.

The original MIDI file from the Nottingham dataset (test set, first 16 bars).
A generated piece. The melody (purple) is generated given the chords (red) using the conditional sequential attention language model.

Thanks to Google Creative Lab for the midi player.

Published and hosted by Github Pages