Keunwoo Choi — GeistHaus

My recent talks

keunwoochoi Jan 6, 2022

Last month, I gave an invited talk at NYU. I somehow ended up talking and talking and talking for 100 minutes. You’d better watch it on YouTube since I put quite a few timestamps there. This talk is based on my previous talk, which was done in Korean, 2021 Nov. In case you’re interested in…More

Show full content

Last month, I gave an invited talk at NYU. I somehow ended up talking and talking and talking for 100 minutes.

You’d better watch it on YouTube since I put quite a few timestamps there.

This talk is based on my previous talk, which was done in Korean, 2021 Nov.

In case you’re interested in what we’re doing in SAMI at ByteDance/TikTok, here’s my 10-min intro video at ISMIR 2021.

Finally, this is a 4-min intro-ish video of my paper. ..and my singing

http://keunwoochoi.wordpress.com/?p=4091

Extensions

In the middle of PhD programs: internship, academia vs industry, etc.

keunwoochoi Dec 1, 2021

Just my two cents to someone who emailed me – but the question is pretty general. There are clear pros and cons in academia/industry. Some are about the work itself – how you spend your time of a day, or a year, and what you’ll get to have learned and not after a few years.…More

Show full content

Just my two cents to someone who emailed me – but the question is pretty general.

There are clear pros and cons in academia/industry. Some are about the work itself – how you spend your time of a day, or a year, and what you’ll get to have learned and not after a few years. Work/life balance can vary a lot too. Likewise, some practical aspects like the salary, where you live, etc.

You should put some effort/invest your time and at least indirectly experience some part of industry. Internship can be an answer, as it’s supposed to be. There can be many factors on planning it – when, how long, but also which company will accept you and not, potential visa issues, which topic, etc. I’d say don’t try to find a perfect chance. 3-month during a PhD without any commitment but just some possibility of publishing something, is very cheap compared to what you’ll have to invest in (only) to “experience” it after graduation, and you wouldn’t want to pay for the price if you were to end up not liking it.

If it makes sense to do it in Europe, do it there. If the company is not like “top” kind of company or something, well.. I think it’s all good. If you were supposed to get disappointed, that’s what it is and you wanna know it as early and accurately as possible.

If it turns out that it’s hard (or not possible) to find the right company, that’s also crucial information.

Glad you’re visiting CNRS. Certainly no one or no environment is perfect, so you’d want to diversify those.

It’s becoming a standard to interview research interns as they do with software engineer interns or full-time researchers. The expectation is lower, but I’m talking about the *way* they do. They’re doing it for good reasons — overall, not a bad idea at all to learn those kinds of stuff, too — essentially, how to write good research codes. ISMIR recently had those masterclass by Peter Sobot if you can find it.

This is it. Hope it helps

http://keunwoochoi.wordpress.com/?p=4084

Extensions

Music Classification: Beyond Supervised Learning, Towards Real-world Applications 📕

keunwoochoi Nov 16, 2021

https://music-classification.github.io/tutorial/ I wrote this book – Music Classification: Beyond Supervised Learning, Towards Real-world Applications with Minz Won and Janne Spijkervet. We used this in our ISMIR 2021 tutorial session and will keep updating the book. When you google “music classification deep learning”, you get nothing but replications of the nice and brief blog posts. Those…More

Show full content

https://music-classification.github.io/tutorial/

I wrote this book – Music Classification: Beyond Supervised Learning, Towards Real-world Applications with Minz Won and Janne Spijkervet. We used this in our ISMIR 2021 tutorial session and will keep updating the book.

When you google “music classification deep learning”, you get nothing but replications of the nice and brief blog posts. Those towardsdatascience-dot-com posts serve a purpose, but no deeper than taking part in Kaggle challenges using bite-sized datasets.

If anyone’s responsible for that, it would be people like me, because we were not filling the knowledge gap while more than a crowd of hobbyists were winning the game of search engine optimization. And..

I DECLARE THAT IT SHALL NOT BE LIKE THAT ANYMORE –

..because now we have this book!

Our book covers the basics to the recent advances and the tips and tricks from our hands-on experience. We added Jupyter notebooks so that you can run model training, evaluation, data augmentation, etc. The content ended up overflowing the original 3-hr slot. Hopefully it’s not overwhelming, as we focused on summarizing the modern history of music classification.

Read the book → https://music-classification.github.io/tutorial/

http://keunwoochoi.wordpress.com/?p=4070

Extensions

My ISMIR 2021 Submission and its reviews

keunwoochoi Jul 15, 2021

As I did 2 years ago on my DrummerNet paper, I’m open-sourcing the reviews my submission received. I did it back then, and I’m doing it again now, since when I had no paper in ISMIR, I was always very curious about how ISMIR review is being done. By not making this information available, I…More

Show full content

As I did 2 years ago on my DrummerNet paper, I’m open-sourcing the reviews my submission received. I did it back then, and I’m doing it again now, since when I had no paper in ISMIR, I was always very curious about how ISMIR review is being done. By not making this information available, I beileve, we’d be causing some survivorship bias – only those who have submitted gets some useful information.

So here we go. The submitted version of the paper is here. The title of the paper is “Listen, Read, and Identify: Multimodal Singing Language Identification of Music”.

Reviewer #1

I received many “Strongly Agree” from this reviewer. I’ll introduce some comments only here.

11. Please explain your assessment of reusable insights in the paper.
This paper provide a reproducible work in singing language identification by using a public Music4All dataset. Preprocessing steps for audio and texts inputs are well presented based on open-source Python libraries. Model structure is also based on well-known deep learning architectures.

18. Main review and comments for the authors
This paper presented LRID-Net, a deep learning model for singing language identification that takes multimodal data, including an audio input and a text input which combines track title, album name, and artist name. A series of experiments were done using a public dataset. This helps the community reproduce the work. Although the audio and text branch reused model structures such as ResNet-50 and MLP, this paper presented detailed experiments to demonstrate the performance and properties of LRID-Net under different use cases of input modalities. Thos provides insights to apply SLID in real world scenario.

There are three places that could be more clarified by the authors.

1. As seen in Figure 1, audio drop was applied just after the audio signal. Given that the audio clips are 30-second and dropout rate was fixed to 0.2, I understand such audio drop is to “mute” 6 seconds in the clip. This may not cope with the real-world cases when audio content is partially missing. Why not apply audio drop after melspectrogram?

2. In 5.4.1 from line 378 to 382, the reasons for “the effects of modality dropout are better reflected on weighted-average scores” need to be rephrased.

3. The authors mentioned they cannot provide a satisfying explanation about some experiment results. If possible, please add any new explanations for camera-ready.

Here, I seem to have confused the reviewer a bit. The audio dropout does *not* mute 20% of the signal – it drops the whole audio signal with a probability of 20%. In other words, what the reviewer #1 is suggesting is already happening.

Reviewer #2

18. Main review and comments for the authors
This paper proposes a solution for the task of detecting the lyrics language of a given song. It does so in a multimodal fashion, incorporating both raw audio data as well as textual data in the form of track metadata (artist, album and track names).

The proposed model is a relatively simple neural network architecture with separate branches for the audio and the textual features. The authors further propose to adopt the concept of modality dropouts to give their model the ability to handle missing features – e.g., still be able to classify a song based solely on audio data, if no textual metadata is available.

The paper is generally easy to read and understand. The authors stress reproducibility as one of their main contributions, and in my opinion, they reach that goal. They use an openly available dataset, and the information on their model architecture given in the paper seems sufficient to re-implement it, also owing to its relative simplicity. That being said, it would be great if the authors went one step further and made their implementation available as open source.

The experiments described by the authors seem sufficient for reaching their conclusions, and I can’t find any fault with their setup. One further potential limitation of their results is that the ground truth labels they use are also just estimates computed by langdetect (on lyrics data, as opposed to the metadata their model uses). Maybe this could/should be discussed?

As a note on the paper structure, I would suggest moving the Dataset section before the LRID-NET section. This would make some information in the LRID-NET section clearer when reading the paper from top to bottom.

Finally, here are some more detailed points:
– 135: What separator character is used for joining, if any?
– 163: What is the reason for the output size of 11? langdetect supports 55 languages, as you have mentioned before – did you decide to disregard some of those and reduce the set of languages to 11? Edit: I see this is explained later. Maybe shift the order around to explain this earlier. In general, I would suggest moving the Dataset section to before the LRID-NET section.
– 177: “One more difference of modality dropout is that there is no 1/(1−r) scaling when the input is not dropped.”
Maybe this could be explain in a bit more detail.
– 179: “During test time, a system with LRID-Net inputs an arbitrary zero vector to the model if a modality is missing.”
What is an “arbitrary zero vector”? Isn’t there only one zero vector (for every given size)?
– 277: No comma after “which”.
– 296: “German”
– 297: “Third, as summarized in Figure 3, in every metric and averaging strategy, TO model outperformed AO model”
But not in every language – do you have a potential explanation for this?
– 300: “[…] audio _is_ less […]”
– 313 and following: The use of + and – in parentheses is not consistent.
– 329: double “model”
– 358: Was there a particular reason for choosing a rate of 0.2?
– 379: “they”
– 444: I think you mean “eliminate”/”alleviate” the need?

Reviewer #2 gave me a good suggestion on clarifying that the ground truth is also coming from the langdetect applied to the lyrics. I’ll add this to the camera-ready version.

The reviewer also kindly fixed many grammar issues that I appreciate a lot!

Reviewer #3

I got a Weak Accept here but with all nice and careful comments. Given the positive comments though, the reason for weak accept is, I guess, limitation in the expected impact.

10. The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community.
Disagree

18. Main review and comments for the authors
*** Novelty of the paper + Stimulation potential ***
> The major contribution of this study is its reproducibility in terms of using a publicly accessible dataset for SLID works.

*** Appropriateness of topic + Importance ***
> While SLID is reportedly not widely studied in the MIT literature, there could have been more implications for future MIR research and practices. For example, how do SLID techniques contribute to the user-centered MIR system design and evaluation?
> The term “multimodal” might give the impression to readers that other less typical sources of information have been taken into account, in addition to text and audio. It would be better to give a clear definition early in the paper.

*** Scholarly / Scientific quality ***
> Is there any literature support for justifying the three selected metadata elements (i.e., track title, album name, artist name)?
> It is excellent that the missing data scenario has been considered. Some songs would be in the form of singles from unreleased albums or simply independent singles not affiliated with any albums.
> Why are multilingual lyrics treated as negligible cases? They might reveal important information for the SLID analysis .In relation to the theme of ISMIR this year (“Cultural Diversity in MIR”), it might be common for songs from some countries to contain lyrics from more than one language — this might help identify the unique characteristics of music from particular cultures.
> It is appreciated that the study prevented the problem of having artist-dependent information confound the SLID analysis. This would allow more leeway in that an artist (e.g., from Canada) has songs from more than one language (e.g., English, French).

*** Reusable insights ***
> Given the rising markets of pop music in Greater China (i.e., China, Taiwan, Hong Kong), the absence of specific mentions of Chinese songs merits some interpretation from the authors.

*** Readability and paper organization ***
> Thanks for spelling out the three main contributions of this study (Lines 85 – 97) , which could have also been added to the Abstract too to build up readers’ momentum.

One thing: I still think it’s fine to assume multi-lingual songs are negligibly rare. My definition of the language of lyric is not naively literal e.g., I’m fine with calling a dominantly Korean song with some English words like “baby”, “oooh”, “yes”, “oh my god” a Korean song. That’s because of the context pop music is consumed, which is the (assumed) target scenario. That kind of songs can be 100% perceived as Korean to any listener. An example of some “true” multi-lingual songs would be this one.

And this kind of song is pretty rare.

Meta Reviewer

The meta review is also absolutely helpful. But on the reproducibility, the meta reviewer seems to have misunderstood a little.. which is kinda likely and understandable given the amount of work meta reviewers are given (and also, anyone can misunderstand anything). Let me explain it a bit.

18. (Initial) Main review and comments for the authors
The paper addresses the interesting problem of singing language identification (SLID). It proposes a machine learning model that combines textual metadata with audio features, and which is also capable of dealing with cases where some inputs are missing. Results of various experimental evaluations are reported.

The paper is generally well prepared and easy to follow. A number of language issues in particular in the later parts should however be fixed. The topic is relevant to the conference. The approach may have some novelty.

I am not entirely convinced by the work for a number of reasons.
* The authors claim to provide a reproducible work, but do not share the code they used in the experiments and they do not share the datasets that they used. Also, they do not report any model hyperparameters, and specifics regarding the data splitting procedure are missing.

I specified that I used ResNet50 with base_dim=64. This is actually enough information to 100% reproduce the model since ResNet50 is a pronoun.

Data splitting information is 100% transparent since I open-sourced the code and the result.

But all the other concerns are actually true!

* The authors do not report if their observed improvements or deteriorations are statistically significant. Some differences seem rather small.

* To me it was furthermore difficult to interpret the obtained precision and recall values on an absolute scale. Are these results good enough to be used in a practical application? The datasets is also highly imbalanced, which makes the interpretation even more difficult. I was also wondering if other metrics than F1/precision/recall could have been used for the multi-class classification problem.

* There are some unexpected observations for which the authors have no explanation. This is worrying as these observations could be the result of a technical error or a design error.

* From a technical perspective, the authors do not provide indications in which ways the audio model contains features that are suited to predict the singing language. Some background should be provided why we expect that such features exist.

This is it. Um… bye!

http://keunwoochoi.wordpress.com/?p=4048

Extensions

slightly better research code – avoid hard-coded values

keunwoochoi Jun 22, 2021

Imagine you need to crop the first 10 second of a waveform. This can be improved by like this. Of course it does the same thing. But this is better because.. Now you know the meaning of this magic number 160000 . And this means that.. Now ANYONE would know the meaning of 160000. Because…More

Show full content

Imagine you need to crop the first 10 second of a waveform.


src = src[:160000]  # sampling rate=16000

This can be improved by like this.


src = src[:10 * 16000]

Of course it does the same thing. But this is better because..

Now you know the meaning of this magic number 160000 .

And this means that..

Now ANYONE would know the meaning of 160000.

Because you just clarified that what you want to do is to get the first 10-second, which may or may not be 160000 samples.

Of course, most of the time, it’d be better to do this.

SR = 16000  # this is defined somewhere, or even in the same function.

# some other stuff

src = sr[:10 * SR]

Why? Because..

You might want to change SR later. You’ll make less mistake if you do this.
- It’ll also make it faster to change the sampling rate.
You can write test codes using the same SR.
The meaning of 16000 is even more clear.

The idea is to make things easier for you, and for some other people. Readability matters!

http://keunwoochoi.wordpress.com/?p=4034

Extensions

Tensorflow2 Keras – Custom loss function and metric classes for multi task learning

keunwoochoi Sep 29, 2020

It is well known that we can use a masking loss for missing-label data, which happens a lot in multi-task learning (example). But how about metrics? Without a similar modification, the keras.metric classes and functions would get you some numbers, but they won’t be quite accurate. No worries though, they can be modified as in…More

Show full content

It is well known that we can use a masking loss for missing-label data, which happens a lot in multi-task learning (example). But how about metrics? Without a similar modification, the keras.metric classes and functions would get you some numbers, but they won’t be quite accurate.

No worries though, they can be modified as in this gist. Or as below, although WordPress doesn’t seem to render gist correctly.

Tested on Tensorflow 2.3.

.gist table { margin-bottom: 0; } This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters import tensorflow as tf import tensorflow.keras.metrics as tfkm from tensorflow.keras import backend as K MASKED_VALUE = -1 def get_1d_mask(y_true, masked_value): """Get 1D mask by comparing y_true with masked_value. By using `reduce_any`, it masks any item that has more than one y_true that is equal to `MASKED_VALUE`. """ mask_2d = K.not_equal(y_true, masked_value) # (batch, n_class) return K.cast_to_floatx(tf.math.reduce_any(mask_2d, axis=1)) # (batch, ) # losses def masked_binary_crossentropy(y_true, y_pred): mask = K.cast_to_floatx(K.not_equal(y_true, MASKED_VALUE)) return K.binary_crossentropy(y_true * mask, y_pred * mask) def masked_categorical_crossentropy(y_true, y_pred): mask = K.cast_to_floatx(K.not_equal(y_true, MASKED_VALUE)) return K.categorical_crossnetropy(y_true * mask, y_pred * mask) # metrics – when there are parent classes class MaskedRecall(tfkm.Recall): def update_state(self, y_true, y_pred, sample_weight=None): mask = get_1d_mask(y_true, MASKED_VALUE) return super().update_state( tf.boolean_mask(y_true, y_mask), tf.boolean_mask(y_pred, mask), sample_weight, ) class MaskedPrecision(tfkm.Precision): def update_state(self, y_true, y_pred, sample_weight=None): mask = get_1d_mask(y_true, MASKED_VALUE) return super().update_state( tf.boolean_mask(y_true, y_mask), tf.boolean_mask(y_pred, mask), sample_weight, ) class MaskedAUC(tfkm.AUC): def update_state(self, y_true, y_pred, sample_weight=None): mask = get_1d_mask(y_true, MASKED_VALUE) return super().update_state( tf.boolean_mask(y_true, y_mask), tf.boolean_mask(y_pred, mask), sample_weight, ) # Customized metric class MaskedCategoricalAccuracy(tfkm.Metric): def __init__(self, name="masked_categorical_accuracy", **kwargs): super(MaskedCategoricalAccuracy, self).__init__(name=name, **kwargs) self.n_corrects = self.add_weight(name="n_corrects", initializer="zeros") self.n_items = self.add_weight(name="n_items", initializer="zeros") def update_state(self, y_true, y_pred, sample_weight=None): # Note: this implementation ignores sample_weight mask = get_1d_mask(y_true, MASKED_VALUE) # (batch, n_class) y_true = tf.boolean_mask(y_true, mask) # (n_items, n_class) y_pred = tf.boolean_mask(y_pred, mask) n_item = K.int_shape(y_true)[0] if n_item in (0, None): return if_correct = K.equal(K.argmax(y_true, axis=1), K.argmax(y_pred, axis=1)) self.n_items.assign_add(K.cast_to_float(n_item)) self.n_corrects.assign_add(K.sum(K.cast_to_floatx(if_correct))) def result(self): if self.n_items == 0.0: return 0.0 return self.n_corrects / self.n_items def reset_states(self): self.n_corrects.assign(0.0) self.n_items.assign(0.0) view raw masked_loss_metric.py hosted with ❤ by GitHub

http://keunwoochoi.wordpress.com/?p=4019

Extensions

Kapre doc → kapre.readthedocs.io

keunwoochoi Sep 9, 2020

Recently, I put some effort to improve it. Now it supports Tensorflow 2.0. Please enjoy! https://kapre.readthedocs.io/en/latest/More

Show full content

Recently, I put some effort to improve it. Now it supports Tensorflow 2.0.

Please enjoy! https://kapre.readthedocs.io/en/latest/

http://keunwoochoi.wordpress.com/?p=4014

Extensions

Some choices I’ve made and why

keunwoochoi Jul 19, 2020

Only occasionally though, I’ve been asked those classic questions like “So how did you start your career?”, “What motivated you to start a PhD course?”, etc., and somehow I ended up promising that I’ll write a post about it. So, here we go. Disclaimer: I’ll be only straightforward, simple, and dumb. Bachelor: EE My tutor…More

Show full content

Only occasionally though, I’ve been asked those classic questions like “So how did you start your career?”, “What motivated you to start a PhD course?”, etc., and somehow I ended up promising that I’ll write a post about it. So, here we go. Disclaimer: I’ll be only straightforward, simple, and dumb.

Bachelor: EE

My tutor during my high school was kind of my role model and he studied EE. But at the last moment, I applied for something else in engineering school because I was not confident about my exam score. Yeah I failed and spent another year studying. OK I finally got a great score, but I wasn’t still sure if I should go EE. But my teacher (who studied and taught Chemistry) was so sure about EE (because the industry of Korea is very EE-centered) that he almost didn’t listen to me. So.. yeah, zero insight of mine.

Master: Applied Acoustics

I was completely lost during my first 2 years in college and was demotivated even before I tried to study hard. Let’s say I liked bass guitar a lot. During the second semester of my second year, I took this class named Acoustics which completely fascinated me. The professor was about to retire soon though, and that didn’t give me any other choice w.r.t. the mandatory military service etc but starting my master on it right after my graduation so here we go.

3-years of working in a research company

So yeah my supervisor retired one year after I graduated my master’s course. Well, I had to deal with my military service anyway. I could’ve tried something else, but I quite liked Acoustics and there’re not a lot of places that I could keep working on it while serving alternative military service. So yeah.

Starting a PhD

When finishing my master, I explicitly decided I’m not going to do a PhD. I liked the work I did for my master’s thesis (don’t read it) and tried to publish it (which I did, fortunately, and also don’t read it). I then wanted to publish more work (in music information retrieval/MIR), I guess because I wanted to prove myself. I tried. But there was no one I could ask supervision in MIR. My works were rejected from ICASSP – ISMIR – ICASSP – ISMIR in 2 years. I thought it is because I was not getting any supervision (i.e., I blamed my circumstance). Well, therefore, let’s me start a PhD!

PhD: Abroad

At the moment, I had many friends who’s doing their PhD in universities in Korea. I didn’t like the downside of it. I also wanted to experience (lol sorry it might be longer than you think) living outside of Korea. I also wanted to work outside of Korea because in there really was only a limited numbers of choices in Korea in my field, Acoustics, um ok which became my thing because I quite liked it & I had to do my military service & the professor was about to retire & I liked music & etc.. Anyway, these are why.

PhD: Music

Acoustics is very fun but many of my colleagues in Applied Acoustic Lab actually started it because they liked music music – even the supervisor – rather than sound. I already knew MIR might be even smaller than applied acoustics/DSP, I knew MIR is definitely smaller than speech, I knew there was no job in MIR in Korea. I don’t know, if I were not to choose what I like, I should’ve gone to a dental school. So what’s the point of compromising at that point? (Actually, there were/are still good reasons to do so, but somehow I didn’t.)

PhD: Computer Science

In the company, I had to run some experiments with many, many speakers etc and it was so painful. I was not even good at it. In the same team, those who were working on audio codec seemed free from those troubles, and that made me want to migrate from the world of EE or (physical) sound to the digital world. They seemed less cumbersome to develop / run experiments / (objectively) evaluate. Just neat. At least relatively neat. Ok, I’m logging in..

PhD: Machine Learning

Some reviewers who rejected my papers said my paper is lack of machine learning practices (which was correct, helpful, and timely). I also realized I couldn’t really comprehend any paper in MIR because I don’t know those things, concepts, symbols, just anything. I started to study about a year before I started my PhD.

PhD: Deep Learning

I was frustrated a bit by seeing the features from computer vision beat audio features in some audio-related tasks. One of my friends, who I studied and worked together for about 10 years, was also a big fan of things in computer vision. In the end, I was convinced that deep learning would probably work well in music because it was already working very well in vision.
Another friend of mine recommended taking the Coursera lecture by Hinton.
I spent my first day of PhD reading Sander’s post about applying ConvNet for audio-based recommendation.
I wanted to get prepared to quit music after my PhD if the industry prospect wouldn’t look good. To do that, I needed to learn something versatile.

My two cents

A lot of coincidences and luck has lead me here.
Some of the limited choices is because I only considered the things I was interested in.
You better figure out what to do after what you’re doing is done. As early as possible.
For me, momentum mattered a lot. Once you’ve invested your time, you want to utilize it. In other words, it’s difficult not to be dependent on where you’ve been. In other words, every choice matters.
But you might be different from me.

Hope it was helpful or amusing, at least.

http://keunwoochoi.wordpress.com/?p=4004

Extensions

Q&A: How to transcribe rap songs

keunwoochoi Jun 9, 2020

… I want to understand what they are rapping about … I want to ask if it is indeed possible to transcribe rap songs? I have vocals extracted from the songs and tried to use Google speech2text API for it but the results look very random and bad. I am given the impression that transcribing…More

Show full content

… I want to understand what they are rapping about … I want to ask if it is indeed possible to transcribe rap songs? I have vocals extracted from the songs and tried to use Google speech2text API for it but the results look very random and bad. I am given the impression that transcribing songs in general to lyrics is far from achieving ok performance atm. Does this sound right to you?
(From an email I received)

From what I’ve heard, I agree – vocal extractor + automatic speech recognition is not a completely working solution. So, what’s the real issue and what we can do? When you’re asking a question like this, you don’t want an answer like “Make a rap song dataset and train an neural net with WaveNet and GAN and CTC loss end-to-end”. Oh wait.. yes that could work, too. But we probably want something simpler.

Obviously, we should google “singing voice transcription”. They must’ve discussed the issues they faced; although you’ll need to take the difference between singing and rapping into account, additionally. However, in this post, I’ll just describe my thought process without reading papers.

Problem definition

People are rapping over some music / beat and we have its recordings. They are not production music we can find on streaming services. We want to transcribe the lyrics. The language is English.

Vocal extractor

Sound source separation is working pretty well these days. Check out recent demos like Open-Unmix. (CAVEAT – training datasets for vocal source separation usually consists of singing voices, not rapping. Even combined with strong drum beats, I wouldn’t worry about the difference between singing and rapping (or more like, the difference in the patterns they have on waveform/spectrogram representations). It’s still worth keeping in mind.) OK, seems like it’s pretty good.

Speech recognition

I believe Google’s speech recognition API should work (nearly) state-of-the-art. That said, it’s not a bad idea to do some survey.. by Googling “google speech2text API performance“. Checked out the first link. Ok they don’t seem to provide any official performance, which can be annoying but also makes sense.

If speech recognition is a problem, why it would be? If it’s not an end-to-end model, ASR (automatic speech recognition) models consists of two stages – acoustic model (audio-to-phoneme) and language models (phoneme-to-word). Even in an end-to-end model, those are what’s happening seamlessly and we could choose to see them separately.

Conclusion… NOT

WELL I DON’T KNOW

Ok let’s try further.

Break down the problems

We have a bunch of smaller questions to answer. I only have some ideas about how we can answer them.

Does vocal separator work well for rapping voices?
- Rapping voices are mixed loud enough, and the instrumentation is relatively sparse in Hip-hop. So yes, I think it would work well. This can be tested by using real examples to the model you use.
Does vocal separator work well for non-production recordings?
- This is hard to answer because I don’t know the recording quality. But probably yes, as long as the signal-to-noice (or voice-to-others) ratio is at least as large as typical music, and as long as the ‘others’ sounds similar to typical non-vocal component in music.
Does vocal separator work well for hip-hop?
- Besides rapping vs singing, I don’t see any potential problem with Hip-hop as a target of vocal separator.
Does speech recognition work well for rapping?
- .. next section.

Transcribe rapping voices

In one hand, rapping is more similar with typical speaking than singing is. I.e., your use-case should be pretty compatible to the acoustic model in the ASR model.
In other hand, there are some particular words and expressions in rapping that won’t show up in typical conversation and script reading (=training dataset). I.e., your use-case might not be compatible to the language model.
- There might be a way to adapt your model with the vocabulary you’re expecting to appear a lot in your test case.

Conclusion

Let’s say there’s no conclusion here chew it, digest it, and find your way!

Some other thought

If it’s a production music, I’d rather run audio fingerprinting and find the corresponding lyrics.

http://keunwoochoi.wordpress.com/?p=3995

Extensions

ICASSP 2020 papers and summaries

keunwoochoi May 18, 2020

Let me reuse my tweets 🙂 https://t.co/ZABBXEDS1c "Improving Universal Sound Separation Using Sound Classification". Used a pre-trained net to extract an embedding that conditions a separation model. Nice work! Turned out it's the same first author (@ETzinis) of the paper above. — Keunwoo Choi (@keunwoochoi) May 18, 2020 https://t.co/hgfCBMnRSU The structure of separate formant mask…More

Show full content

Let me reuse my tweets

https://t.co/ZABBXEDS1c "Improving Universal Sound Separation Using Sound Classification". Used a pre-trained net to extract an embedding that conditions a separation model. Nice work! Turned out it's the same first author (@ETzinis) of the paper above.

— Keunwoo Choi (@keunwoochoi) May 18, 2020

https://t.co/hgfCBMnRSU The structure of separate formant mask decoder and pitch skeleton decoder, the singer identity embedding is used to control two different aspects of the synthesized vocal – formant (timbre) and pitch skeleton (singing style). By @92HsChoi #icassp2020

— Keunwoo Choi (@keunwoochoi) May 18, 2020

https://t.co/pUDoAkfQ2F "PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network". Encoder is learned not to be able to perform singer/pitch tasks well so that its output would not have that info. Instead, they are fed to the decoder in separate paths. pic.twitter.com/AH3q6yMUo9

— Keunwoo Choi (@keunwoochoi) May 18, 2020

http://keunwoochoi.wordpress.com/?p=3986

Extensions