After onset detection and pitch extraction, the information of each note in the query has been extracted. The next step is to transfer the information into an expression that is suitable for melody matching. In our system, not only the pitch but also the tempo information is used. Since the queries are from different people, the pitch and tempo information might be difference even they sing the same song. Generally speaking, the singing key of male is lower than female, and the speed of singing has a high probability lower or higher than reference in database. To solve this problem, we use time duration and pitch interval of notes.

在檢測起始點和提取音高後,查詢中每個音符的信息都已被提取出來。下一步是將這些信息轉換為適合旋律匹配的表達形式。在我們的系統中,不僅使用了音高信息,還使用了節奏信息。由於查詢來自不同的人,即使他們演唱同一首歌曲,音高和節奏信息也可能有所不同。一般來說,男性的演唱音調比女性低,而演唱速度很可能比資料庫中的參考速度快或慢。為了解決這個問題,我們使用音符的時間長度和音高間隔。

For pitch, we take difference of MIDI number to deal with key transposition. Note that the pitch information taken in the previous stage is in frequency and people who are amateur almost cannot sing very correctly. To deal with it, we change the fundamental frequency information into MIDI number with function (1.1). After that, instead of using rounding to get integer MIDI number sequence, we shift up and down by 0.1 each time to find out the best shift with least error. See Table 4.1 as an example. It is obvious that the first and the second notes have the same pitch, but using rounding would cause to pitch interval by 1, which is wrong. After finding the best MIDI sequence, we take difference as the feature for melody matching.

對於音高,我們使用 MIDI 編號的差異來處理音調轉換問題。需要注意的是,在前一階段獲取的音高信息是以頻率表示的,而且業餘人士幾乎無法非常準確地演唱。為了解決這個問題,我們使用函數(1.1)將基本頻率信息轉換為 MIDI 編號。之後,我們不採用四捨五入來獲取整數 MIDI 序列,而是每次上下調整 0.1,以找出誤差最小的最佳偏移量。請參見表 4.1 作為示例。顯然,第一個和第二個音符的音高相同,但使用四捨五入會導致音高間隔為 1,這是錯誤的。在找到最佳 MIDI 序列後,我們將差異作為旋律匹配的特徵。

image.png

For tempo feature, we use inter-onsets-intervals (IOIs). IOIs is the time duration of consecutive notes which is often used as a feature for melody matching. As pitch interval, IOI is transferred into IOI ratio to deal with the problem that every person sings in different speed. It is more robust to singing speed. However, usually people sing a song with better precise in pitch than tempo. For this reason, instead of using IOI ratio directly in our melody matching system, we convert the IOI ratio into three class by

對於節奏特徵,我們使用音符間隔時間(IOIs)。IOIs 是連續音符之間的時間持續長度,經常作為旋律匹配的特徵使用。與音高間隔類似,IOI 被轉換為 IOI 比率,以解決每個人以不同速度歌唱的問題。這對歌唱速度的變化更具穩健性。然而,通常人們在音高上的演唱精確度要優於節奏。因此,在我們的旋律匹配系統中,我們不直接使用 IOI 比率,而是將 IOI 比率轉換為三個類別

image.png

In our matching system, we use both the pitch interval and mapping beat as the feature. We have introduced several melody matching methods in Chapter 3. Because the number of songs in database in real world is a very large number, the efficiency is very important. We propose a method based on hidden Markov model and dynamic programming using progressive filtering with HMM as the first stage. The system block of melody matching is shown in Fig. 4.13. Although the HMM of each song in database can be constructed offline and the speed of matching is fast, the accuracy is not very good. The DP is more effective but need more computation time. Considering the system speed, we use our modified HMM as the first stage to filter out impossible songs in database. Then using modified DP for advanced matching.

在我們的匹配系統中,我們同時使用音高間隔和映射節拍作為特徵。我們在第三章已介紹了幾種旋律匹配方法。由於現實世界中資料庫中的歌曲數量非常龐大,效率變得極為重要。我們提出一種基於隱藏馬可夫模型和動態規劃的方法,使用HMM作為第一階段的漸進式過濾。旋律匹配的系統架構如圖4.13所示。雖然資料庫中每首歌曲的HMM可以離線構建且匹配速度較快,但準確性並不是很高。動態規劃更有效但需要更多計算時間。考慮系統速度,我們使用改良的HMM作為第一階段過濾掉資料庫中不可能的歌曲,然後使用改良的動態規劃進行進階匹配。

1. Proposed Hidden Markov Model

We have introduced hidden Markov model in section 3.2, and we have known that although the HMM is very efficient, the accuracy is not good. It is not robust to insertion or deletion and the error of pitch. Hence, this method is used to sift the targets that are impossible to match the humming signal.

我們已經在 3.2 節介紹了隱藏式馬可夫模型,並且了解到雖然 HMM 效率很高,但準確度不佳。它對插入或刪除以及音高誤差的穩健性不足。因此,這種方法被用來篩選出不可能與哼唱信號匹配的目標。

In order to improve the performance of the HMM, we consider that the pitch sung might be one or two semitone higher or lower. Give the first two sentence of the child song Twinkle, Twinkle, Little Star as example, the MIDI number is {60, 60, 67, 67, 69, 69, 67, 65, 65, 64, 64, 62, 62, 60}, and the pitch interval in semitone is {0, 7, 0, 2, 0, -2, -2, 0, -1, 0,-2, 0, -2}. The model after normalization can be built as in Table 4.2 as we only show to the third digit after decimal point. Here we give the cases not happen with probability P_m = 0.001. If someone sings the first sentence as {60, 60, 66, 66, 69, 69, 67}, we can still realize what the song he sings although there is a little different. However, if we apply the conventional HMM in Table 4.2, the probability that the song he sings is Twinkle, Twinkle, Little Star would become very small since the pitch interval he sings is {0, 6, 0, 3, 0, -2} and the probability of H(0, 6) and H(0, 3) is very low.

為了提高隱馬可夫模型(HMM)的性能,我們考慮到所唱的音調可能會高或低一到兩個半音。以兒童歌曲《閃亮,閃亮,小星星》的前兩句為例,對應的MIDI音符為 {60, 60, 67, 67, 69, 69, 67, 65, 65, 64, 64, 62, 62, 60},而音高間隔(以半音計)為 {0, 7, 0, 2, 0, -2, -2, 0, -1, 0, -2, 0, -2}。經過正規化後的模型可以如表4.2所示,我們僅顯示小數點後第三位。這裡我們給出不發生的情況,其概率為 P_m = 0.001。如果有人將第一句唱作 {60, 60, 66, 66, 69, 69, 67},儘管有些許不同,我們仍然能夠辨識出他所唱的歌曲。然而,如果我們應用表4.2中的傳統HMM,他所唱的《閃亮,閃亮,小星星》的概率將變得非常小,因為他所唱的音高間隔為 {0, 6, 0, 3, 0, -2},而 H(0, 6) 和 H(0, 3) 的概率都非常低。

image.png

Owing to the fact mentioned above, the HMM cannot be established by only the correct pitch interval of targets. We propose the diffused HMM to conquer this problem. Before adding a small probability into those transitions not happen, we add the diffusion probability matrix in Fig. 4.14 into the HMM model for each transition happens in target. The diffusion matrix should satisfy

由於上述原因,僅依賴目標的正確音高間隔無法建立隱馬可夫模型(HMM)。為了解決這個問題,我們提出了擴散HMM。在將小概率添加到那些不會發生的轉移之前,我們將圖4.14中的擴散概率矩陣添加到每個目標發生的轉移中。擴散矩陣應滿足以下條件:

image.png

For example, the first transition is from 0 to 7. While the observed transition matrix is shown as Table 4.3, we add the diffusion matrix into the yellow region. After all the transitions done the procedure, give a small probability P_m=0.001 to those case still with probability 0. In this case, the HMM matrix is changed into Table 4.4. Note that the probability in H(0,3) and H(0,6) is obviously larger than original. In this case, the probability that one sings this song would become more reasonable.