Introduction:

In this post, I'll be exploring an application of optical music recognition (OMR) to bagpipe notes (I'm also a professional bagpiper). OMR is the field where machine learning and computer vision methods are applied to identify and classify musical notation and symbols. It is closely related to the field of OCR (optical character recognition) which is widely used in many tech products today.

A unique facet about bagpipe notes is that our scale or range of notes is limited to nine notes. We have just over one octave, ranging from "low g" through "high a". This means that our task is closely related to the classic MNIST task, where hand written digits 0-9 are recognized. However, these nine notes are only a subset of piping notation – other symbols, technique, and timings notation are also included – but the scale is a good place to start.


The Dataset:

The advances of computer vision are due in part to the wide availability of high quality datasets such as ImageNet that researchers can work with. The availability of datasets for OMR is less developed, but there are a few available. However, none meet the niche needs of bagpiping music. In place of this I created a synthetic dataset.

To do this, I first typeset the scale in LilyPond and output this to png images. This resulted in 9 images – a quarter note for each note of the scale. Then I applied data augmentation to turn these samples into a full (albeit small) dataset. Data augmentation is a process whereby operations such as cropping, skewing, and other transformations and distortions are used to increase the number of image samples available to train a machine learning model on. This is especially important for deep learning methods which often need huge amounts of data, as well as in domains where the collection or labeling of datasets is difficult or expensive. For the final dataset I had 509 samples – roughly 50-60 samples for each note. I also created a downscaled version of the dataset where each image's dimensions was reduced by a factor of 10. While there can be issues using fully synthetic datasets, it allows us to get started without labeling a collection of scores to create a full dataset.

The Bagpipe scale: High A through Low G

This is the scale I started with. To augment the images I used the Augmentor package and created a pipeline:

The pipeline applied the zoom, rotate, and random distortion operations and then sampled 500 images. Once complete I had 500 augmented images along with the original 9. As an example of the images obtained, below is a few samples of the note E.

Classifying Notes:

This task is classifying images of notes so that we can recognize them when found in a score. Classification relies on creating a decision boundary by finding something called a linearly separating hyperplane. Essentially, this is "can you draw a line between two sections of the space such that misclassified examples are minimized?" It is useful to make note that since we have more than two notes, our approach will use multi-class classification instead of binary classification. Conveniently, scikit-learn – the tool I'll be using – handles this directly.


To classify the images, I will use three algorithms: Decision Trees, Support Vector Machines, and Neural Networks. Decision trees are an eager learning method that continually split the data into smaller groups with fewer misclassified samples. This method has the benefit of high explainability because these splits create a tree which can be read as a series of  boolean "and" rules when traversing from the root to a leaf node. Support Vector Machines are another effective classification method which learns a decision boundary by finding a vector that maximizes the margin – or space – between the sample and decision boundary for each sample in the dataset. Finally, neural networks are a huge field, with many types and architectures of networks, but here I use a simple multilayer perceptron. This type of neural network creates a feedforward network with at least one hidden layer. Each layer is comprised a of series of perceptron units along with their corresponding activation function. More complicated deep learning architectures such as Convolutional Neural Networks and ResNets have lead to many advances and are are incredibly effective for a number of problem domains.


Experiment:

Since our data is black and white images the data our algorithms will learn on are a matrix of numbers ranging 0 to 255 for the corresponding pixel value. The columns are pixels and each row is a separate image sample. The notes or class labels have also been mapped to numeric values. For instance:

Finally, I use the implementations in scikit-learn, along with the default parameters. This is done by instantiating the model class, fitting it to the training data, and calling predict to get the label predictions. Then we can compare how well the model performed using accuracy as our metric.

clf = SVC()
clf.fit(X_train, y_train)

preds = clf.predict(X_test)
report = classification_report(y_test, preds)
print("Accuracy: ", accuracy_score(y_test, preds))

>>> Accuracy: 0.22023809523809523

This is dismally low accuracy, and stems from the fact that SVMs are sensitive to the scale of the data that it is trained on. Taking a look at a confusion matrix shows that it classifies nearly all samples as 1. Ideally, we would see nearly all predictions along the identity (rows,cols of i,j) instead of a single column.

To improve this we can standardize the training data to remove mean and scale all values to unit variance:

sc = StandardScaler()
X_scaled = sc.fit_transform(df.drop(columns=['class']))

Once done and re-running our SVM the performance improves to 97.6%.

For Multilayer Perceptrons, we follow the same approach and again see low performance to start (~9.5% accuracy). This can be resolve using PCA (principle component analysis) for dimensionality reduction, reducing the number of features from 40,893 (416 for the downscaled images) to 9. This leads to both improved performance as well as faster training time, since there is far less data to handle, while retaining most of the information in the original data. Again this is a preprocessing technique applied as such:

pca = PCA(n_components=9)
X_pca = pca.fit_transform(df.drop(columns=['class']))

From left to right, shows the original confusion matrix, a plot of the new features, along with the improved results (97% accuracy).

Finally, decision trees don't suffer from either of these issues and have decent performance out of the box. The final results applied to the dataset that was both scaled and reduced was quite good ~99%. While likely due to the small, synthetic dataset used it gives a good place to go from here. Many problems require additional hyper parameter tuning with techniques such as cross validation and grid search.

Decision Trees:          0.9940476190476191
Multilayer Perceptrons:  0.9940476190476191
Support Vector Machines: 1.0

Up Next:

This post was a start to applying machine learning for optical music recognition. In the next post I'll continue this with a dataset containing more bagpipe music notation.