Artificial intelligence has been tapped for synthetic music generation for some time now. However, the watershed moment came when music informatics met AI. Now, music AI researchers are taking advantage of the latest AI, ML, and analytics developments to develop models to create music on par with human composers. 

The strength of any AI/ML model is predicated on the data it’s fed. Below, we look at the major music generation datasets doing the rounds in 2022.

NSynth 

The dataset probably contains the largest instrumental notes at 305,979 musical notes, including unique pitch, timbre and envelope. The musical notes were collected from 1,006 instruments from commercial sample libraries and are annotated based on (acoustic, electronic or synthetic) instrument family and sonic qualities. Instruments including bass, flute guitar, keyboard, mallet, organ, reed, string, synth lead and vocal have been used in this dataset.

MAESTRO

The MAESTRO (MIDI and Audio Edited for Synchronous Tracks and Organisation) dataset has over 200 hours of paired audio and MIDI recordings of International Piano-e-Competition in the past ten years. The MIDI data has key strike velocities, including sustain/sostenuto/una corda pedal positions. 

URMP (University of Rochester Multi-Modal Musical Performance) 

URMP introduced a dataset for facilitating audio-visual analysis of musical performances. The dataset contains several simple multi-instrument musical pieces assembled from separately recorded performances of individual tracks. For each piece, a musical score is provided in MIDI format.

Bach Doodle Dataset

The dataset consists of 21.6 million harmonisations submitted from the Bach Doodle and metadata about the composition, like country of origin and the feedback. It also has MIDI of user-entered melody and MIDI of the generated harmonisation. An exploration of the melodies in the dataset contains top repeated melodies from each country or the regional hits.

The Lakh MIDI Dataset v0.1

This dataset contains 176,581 unique MIDI files, 45,129 matched and aligned to the Million Song Dataset entries. The dataset mainly facilitates large-scale music information retrieval, both symbolic and audio content-based. 

Music 21

Music21 contains music performances from 21 categories. It is a set of tools to help scholars quickly find answers to questions like, “I wonder how often Bach does that” or “Which band used these chords for the first time? Or “How to know more about Renaissance counterpoint or the Indian ragas or post-tonal pitch structures or the form of minutes”. 

Datasets for Indian Music

CompMusic catalogues datasets for Indian art music. The website aims to advance the automatic description of music by emphasising cultural specificity carrying research in music information processing with a domain knowledge approach. The project mainly focused on five music traditions of the world: Hindustani (North India), Carnatic (South India), Turkish-makam (Turkey), Arab-Andalusian (Maghreb), and Beijing Opera (China).

Indian Music Tonic Dataset

The dataset contains 597 commercially available audio music recordings of Indian art music, both Hindustani and Carnatic music, where each recording is manually annotated with the tonic of the lead artist. 

http://compmusic.upf.edu/iam-tonic-dataset

Carnatic Varnam Dataset

The dataset has 28 solo vocal recordings recorded to be researched on the intonation analysis of Carnatic ragas. In addition, the dataset contains audio recordings, time aligned tala cycle annotations and swara notations in a machine-readable format. 

http://compmusic.upf.edu/carnatic-varnam-dataset

Carnatic Music Rhythm Dataset

This dataset is a sub-collection of 176 excerpts in four taalas of Carnatic music with audio, tala related metadata and time aligned markers to indicate the progress through the tala cycles. 

http://compmusic.upf.edu/carnatic-rhythm-dataset

Hindustani Music Rhythm Dataset

The dataset is a sub-collection of 151 in four taals of Hindustani music which includes audio, taal related metadata and time aligned markers to indicate the progress through the taal cycles. 

http://compmusic.upf.edu/hindustani-rhythm-dataset

Mridangam Stroke Dataset

The dataset contains 7,162 audio examples of individual strokes of the Mridangam in various tonics and ten different strokes played on Mridangams with six tonic values. 

http://compmusic.upf.edu/mridangam-stroke-dataset

Mridangam Tani-avarthanam Dataset

The dataset is a transcribed collection of two tani-avarthanams played by Mridangam maestro Padmavibhushan Umayalpuram K. Sivaraman. The audio of the dataset was recorded at the IIT Madras and annotated by Carnatic percussionists. The dataset contains 24 minute of audio and 8,800 strokes.

http://compmusic.upf.edu/mridangam-tani-dataset

Saraga: Research datasets of Indian Art Music

The repository contains time aligned melody, rhythm and structural annotations for two large open datasets of Indian Art Music (Carnatic and Hindustani music). 

https://zenodo.org/record/4301737#.YTddzcaxX0o

Tabla Solo Dataset

The dataset is a transcribed collection of Tabla solo audio recordings spanning compositions from six different Gharanas of Tabla, played by Pt. Arvind Mulgaonkar. It consists of audio and time aligned bol transcriptions. 

http://compmusic.upf.edu/tabla-solo-dataset