# 科技部補助專題研究計畫成果報告

# 期末報告

# 符合ANSI S1.11規範之低群延遲、低複雜度與高規格的助聽輔 具濾波器組演算法設計與其晶片實現(II)

- 計 畫 類 別 : 個別型計畫
- 計畫編號: MOST 104-2221-E-343-003-
- 執行期間: 104年08月01日至105年10月31日
- 執行單位: 南華大學資訊工程學系
- 計畫主持人: 賴信志
- 共同主持人:郭明雯、羅敦信
- 計畫參與人員: 大專生-兼任助理人員:張育軒 大專生-兼任助理人員:張翊緯 大專生-兼任助理人員:許庭豪 大專生-兼任助理人員:洪得軒 大專生-兼任助理人員:梁馨儀 大專生-兼任助理人員:劉雅筑 大專生-兼任助理人員:劉雅筑 大專生-兼任助理人員:邱涵榆 博士班研究生-兼任助理人員:莊文河

報告附件:出席國際學術會議心得報告

# 中華民國 106 年 02 月 06 日

中 文 摘 要 : 本研究提出以離散餘弦轉換與離散傳立葉轉換為基礎之核心架構 , 來實現符合 ANSI S1.11 所規範的1/3 八度音分頻方式中的近似Class-2、 Class-2、以及最高規格 Class-0。所提出之技術主要目的為了改善各頻帶間的不匹配現象 , 亦希望能獲得較低的 群延遲與較低的計算複雜度,以及能有效降低聽力圖上的匹配誤差 。本研究所提出之符 合Class-2 規格之濾波器組與2010 年Kuo et al.所提出之同等 级的平行架構濾波器組設 計方式比較:每個樣本點可以有效降低94.6%的乘法運算量,也降低 80.8%的加法運算 量,而整體群延遲僅有17ms 比Kuo et al.的28ms 要來的少。 本研究亦嘗試提出符合 Class-0 規格之濾波器組,目前嘗試發展之眼算法每個樣本點僅需 145 個乘法運算量及 409個加法運算量。結果初步顯示未來若能有效地採用本研究所提出 之概念,將有助於 高規格助聽器的設計與發展。 中文關鍵詞: ANSI S1.11, 助聽器, 濾波器組 英文摘要: This research mainly proposes guasi-class-2, class-2, and class-0 ANSI S1.11 1/3-octave filter bank design based on DCT and DFT modulation with the advantages of shorter group delay, lower computational complexity, and lower

matching error of the audiogram. Compared with Kuo et al.' s parallel class-2 ANSI S1.11 design, the proposed class-2 filterbank method totally has 94.6% reductions of multiplications per sample, 80.8% reductions of additions per sample, and 17-ms group delay only (less than Kuo et al.' s 28-ms group delay). For class-0 ANSI S1.11 filter bank. the proposed method would take 145 multiplications per sample, and 409 additions per sample. These results show that this research would bring more design concepts and application on the filterbank design topic of hearing aids, and the proposed method would be more suitable for high-level applications of hearing aids in the future.

英文關鍵詞: ANSI S1.11, hearing aid, filterbank

# 科技部補助專題研究計畫成果報告 (期末報告)

符合 ANSI S1.11 規範之低群延遲、低複雜度與高規格的助聽輔 具濾波器組演算法設計與其晶片實現(II)

計畫類別:個別型計畫 計畫編號:MOST 104-2221-E-343-003-執行期間:104年8月1日至106年10月31日 執行機構及系所:南華大學 資訊工程學系

計畫主持人: 賴信志

共同主持人: 郭明雯、羅敦信

計畫參與人員: 莊文河、張育軒、張翊緯、

許庭豪、梁馨儀、洪得軒、

吴佩珊、邱涵榆、劉雅筑。

本計畫除繳交成果報告外,另含下列出國報告,共\_2\_份: ■出席國際學術會議心得報告

期末報告處理方式:

1. 公開方式:

□非列管計畫亦不具下列情形,立即公開查詢

■涉及專利或其他智慧財產權,□一年■二年後可公開查詢

- 2.「本研究」是否已有嚴重損及公共利益之發現:■否 □是
- 「本報告」是否建議提供政府單位施政參考 ■否 □是, \_\_\_(請列 舉提供之單位;本部不經審議,依勾選逕予轉送)

中華民國106年1月31日

1

摘要—本計畫欲達成之目標共設定以下三種規格: (1). Quasi-ANSI S1.11 class-2 DCT-based 濾波器組演算法設計(最低複雜度、最低群延遲); (2). ANSI S1.11 class-2 DCT-based濾波器組演 算法設計; (3). ANSI S1.11 class-0 FFT-based濾波器組演算法設計。除此之外,本計畫亦針對未 來所需要之適應性噪音抑制所需之核心DFT based NLMS演算法開發(4).Sliding DFT快速演算 法。

# I. 簡介

醫師診斷與確認聽損的情況主要透過聽力檢測的測試,即:(Test of Pure Tone Audiogram, PTA),並且利用分布於250Hz到8kHz之間的八度音弦波刺激信號來量測病患的聽損臨界值。 ANSI S1.11 [1] 規範了class-0、class-1以及class-2的八度音頻帶所需的最小與最大的衰減限制。 針對音頻的補償,三分之一八度音廣泛用來結合NAL1 [2] and NAL2 [3]等補償策略以獲得較好 的效果。Nielsen等人 [4] 提出了一種以7個頻帶的IFIR濾波器組,並且透過非同步的技術與低 功耗的方式來實現其電路。Li等人 [5] 則提出了一個DFT為基礎的濾波器組設計來減少實現上 所需的大量複雜度。Chong 等人 [6] 則選擇設計一個critical-like且為16-channel與non-uniform 的濾波器組來匹配人類的聽覺;然而,這樣的一個系統並沒能符合ANSI S1.11的規範。而Kuo 等人 [7] 近來則提出一種更為有效且為18個頻帶的ANSI S1.11濾波器組,並藉由multi-rate演算 法的概念來實現,然而付出的卻是group delay的增加。因此,也導致了auditory與visual cues的 不匹配現象。也由於當group delay大於50ms時,人類的耳朵能夠輕易地察覺到不同,這樣的group delay的影響又被稱為"Hass Effect" [9, 10]。另一方面,Agnew's [11] 則建議group delay的影響應 該低於 20ms以內,以便避免視覺與語音不同步所造成的干擾情況。

為了縮短group delay,兩種以近似ANSI S1.11三分之一八度音的濾波器組設計 [12,13] 被陸續提出。透過Liu等人 [13] 的結果顯示,該設計在計算上仍需要大量的乘法、加法,以及係數需求。這也意謂著再計算複雜度與group delay的cost上,我們很難達到一個設計平衡。在他們的研究中,主要是透過換取計算上的增加一些負擔來降低group delay。近來,也有相關的研究如Wei等人 [14, 15, 16] 針對低複雜度 non-uniform的濾波器架構進行設計,而根據[16] 其matching error與group delay分析的結果仍高於其他設計,儘管該設計提出了一個可組態化的濾波器組設計架構是一大創新。另也有一些研究針對FFT/DCT polyphase (or polyphase-like)的架構進行設計,如 [17-19]。不同於前面的設計方案,這樣的設計可以直接產生M-channel sub-banks,而在計算上仍需要一些額外的負擔。

如上所述,若考量符合ANSI規範,我們將本計畫相關設計主軸分成近似class-2、class-2, 以及class-0的濾波器組設計。由於class-2已於先期計畫提出,而近似class-2屬於前期計畫與本計 畫初期進行,故本計畫內容簡要討論以DCT為核心基礎的近似class-2、class-2核心設計重點, 並將最終設計導向class-0濾波器組的解決方案。

### II. 提出之18-BAND QUASI-CLASS-2濾波器組設計

# A. Quasi-class-2 / class-2 specification and multi-rate concept

ANSI S1.11 [1] 規範了43個三分之一八度音的頻帶,其頻寬範圍從0Hz到20kHz間。第n個

頻帶的中心頻率 fm 定義如公式(1),其中 f<sub>ref</sub> 為參考頻率,定義為1 kHz。 每一個band對應的 頻寬大小 Δf 則定義如公式(2)。圖1為近似ANSI S1.11 class-2 三分之一八度音的濾波器規範, 其中綠線與紅線分別為其上限與下限。

$$f_m(n) = 2^{\left(\frac{n-30}{3}\right)} \times f_{ref}$$
(1)  
$$\Delta f = f_1(n) - f_2(n), f_1(n) = f_m(n) \times 2^{-(1+k)/6},$$
$$f_2(n) = f_m(n) \times 2^{(1+k)/6}.$$
(2)



**1**. Design specification of quasi-class2 ANSI S1.11 filterbank.

在本研究中,我們採用multi-rate的演算法來設計濾波器組,如圖2所示。為了達到有效的設計,我們僅使用two-stage的multi-rate架構,透過重複利用同一組9個頻帶的方式,來實現一個具有18個(9x2=18)頻帶的濾波器組,該設計方法可適用於class-2與近似class-2。由於group delay與 computational complexity主要受到濾波器tap數目的影響,因此如何減少每一組濾波器的長度為設計的一個考量點。然而,若只專注於減低濾波器的tap數目,則會造成其頻率響應變差,因此 從概念上需要反覆不斷的檢驗所設計出的濾波器是否會造成不符合規格。從概念上,我們可以透過公式(3)來計算group delay的影響,其中#L、 #D、以及#I分別為low-pass filter、decimation filter、以及interpolation filter的個別tap數目。

Group Delay = 
$$\left(\left\lfloor \frac{\#D}{2} \right\rfloor + \left\lfloor \frac{\#L}{2} \times \frac{1}{8} \right\rfloor\right) \times 8 + \left\lfloor \frac{\#I}{2} \right\rfloor$$
 (3)

我們所提出的設計中,一個以non-uniform與recursive-cosine-modulation 技巧為設計基礎的 9-band濾波器組主要透過結合mult-irate algorithm來實現,如圖2所示,其中(D)表示decimation filter、(I)表示interpolation filter,分別都被採用來避免frequency bands之間所造成的aliasing。



B 2. The proposed architecture of 18-band filterbank design.

# B. Proposed non-uniform 9-band filterbank design

為了符合non-uniform ANSI濾波器的設計規範,我們首先設計一個uniform 9-band的濾波器 組。公式(4)定義了一個以cosine-modulation為基礎的濾波器,其中h(k)為一個低通濾波器、L為 濾波器的tap數目、M為cosine modulation的轉換長度、i為調變後之頻譜索引位址、 $k_0$  為線性相 位的控制因子。圖3顯示了cosine-modulation的頻率響應圖,圖4則為其對應的uniform濾波器組 設計。藉由取代所有的 $z^{-1}$ 為全通濾波器A(z),如公式(5),我們可以獲得一個non-uniform的濾波 器組轉移函數,相關推導如公式(6)與(7),值得注意的是其中的"a"值可作為決定non-uniform的 濾波器組頻寬大小,如圖5所示。

$$y_{i}(n) = \sum_{k=0}^{L-1} x(n-k) \cdot h(k) \cdot 2 \cdot \cos\left(\frac{\pi}{M} \cdot i \cdot (k-k_{0})\right),$$
where  $k_{0} = \begin{cases} (M-1)/2 & , L = qM; q \in N. \\ \frac{(\text{mod}(L, M) - 1)}{2} & , L = qM + p; q \in N; (4) \\ 1 \le p \le M - 1. \end{cases}$ 

$$A(z) = \frac{z^{-1} - a}{1 - a \cdot z^{-1}} \tag{5}$$

$$\bar{H}_{i}(z) = \frac{T_{i}(z)}{X(z)} = \sum_{k=0}^{\infty} A(z)^{k} \cdot h(k) \cdot 2 \cdot \cos\left(\frac{\pi}{M} \cdot i \cdot (k-k_{0})\right)$$
(6)  
$$\bar{H}_{i}(e^{j\varphi(\Omega)}) = \sum_{k=0}^{L-1} (e^{-j\varphi(\Omega)})^{k} \cdot h(k) \cdot 2 \cdot \cos\left(\frac{\pi}{M} \cdot i \cdot (k-k_{0})\right)$$
$$, where \ \varphi(\Omega) = 2 \cdot \arctan\left(\frac{1+a}{1-a} \tan\left(\frac{\Omega}{2}\right)\right).$$
(7)



圖 3. Magnitude frequency responses of the cosine-modulated band-pass filters.



**B4**. The design concept of 9-band uniform filterbank structure.



**B**5. An example of magnitude frequency responses of 6-band filterbank with 5 different values of "a".

# C. Key component design and matching error calibration technology

我們所提出的濾波器組設計在關鍵參數的設計上可總結如下:1) 全通濾波器所需之參數"a" 值得決定,使其滿足quansi-class2 ANSI S1.1與class2 ANSI S1.1,並確實存在有效區間。2)離型 化低通濾波器的設計使其經過DCT調變後能滿足ANSI規範所定義之區間。3)計算出最少所需的 DCT轉換長度,以利做最佳的轉換。圖6為整個核心設計的演算流程。相關具體之成果已發表 於[20]



**B**6. Key component design flow for the proposed filterbank.

# D. Non-uniform class-0 FFT-based Filterbank Design

本方式將採用以快速傅立葉轉換為基礎的濾波器組(FFT-based Filterbank)的架構[21],將時 域訊號轉換成頻譜資訊得到其能量,透過頻譜上直接抑制頻率點(Frequency Bin)的衰減程度來 達成時域的濾波效果[22]。FFT-based Filterbank的轉換長度(Transform Length)影響其頻率解析度 (Frequency Resolution),而頻率解析度的好壞又會直接影響濾波效能。一般而言,FFT-based Filterbank濾波效果較不會有明顯的鄰近子頻帶交疊情形,同時也能避免多速率濾波器組繁雜的 設計過程。除此之外,使用Weight Overlap-add (WOLA)[23]架構合成訊號能有效的分析與重建 信號,這是因為FFT/IFFT和WOLA皆為線性系統,各個子頻帶的群延遲都相同,合成後之訊號 能避免多速率濾波器組各個頻帶群延遲影響,其中WOLA中所使用的窗函數是為了減少傳立葉 轉換所造成的頻譜洩漏問題,是故,WOLA架構之分析窗函數與合成窗函數的選擇為此技術之 關鍵點。

◆ 頻譜洩漏的影響

FFT 的轉換長度大小會影響每個頻率點代表之頻率解析度,假設在相同取樣頻率 (Sampling Rate)下,當FFT 轉換長度越大則頻率解析度越佳,這意謂著每個頻率點能量越 集中。文獻[11]指出FFT 的轉換長度越大時,有較佳的濾波效果如圖7所示,其中該文獻 之M 值為傅立葉轉換長度,α參數為 Roll-off Factor 會影響過渡頻帶(Transition Band)的頻 寬大小。當α值越小時轉移頻帶的頻寬越小,隨著α增加,轉移頻帶之頻寬會隨之增加。



**圖 7 FFT** 轉換程度與其頻譜衰減程度的關係

◆ 頻率解析度的選定

為了設計符合 ANSI S1.11 class1 規格濾波器組,我們必須先瞭解規格限制的各個子頻 帶之頻寬範圍。以最高頻帶(第 39 頻帶)的頻寬範圍 7127Hz~8979Hz 為例,基於奈奎斯取 樣定理(Nyquist Sampling Theorem),取樣頻率必須大於兩倍以上的最高頻率 8979Hz,也就 是大約 18KHz 以上才不會有頻帶混疊(Aliasing)產生。再來我們討論所需的傅立葉轉換長 度與頻率解析能力,我們知道頻率解析度定義如(8)式,其單位為 Hz/sample、Fs 為取樣頻 率、N為傅立葉轉換長度。一般而言,一帶通濾波器(Band-pass Filter)有 4 個關鍵頻率點需 要設計,即 2 個截止頻率和 2 個通過頻率,以頻寬最窄且最難以設計的頻帶(第 22 頻帶) 當作頻率解析度之設計標準,其頻帶範圍為 140.3Hz~176.8Hz,中心頻率約為 160Hz,總 頻寬約為 40Hz,因此,總頻寬除四個頻率點為 10(Hz/sample),即頻率解析度至少要 10(Hz/sample)才能夠符合 ANSI S1.11 第 22 頻帶的設計要求,因此本研究將使用的取樣頻 率為 20.48KHz,即傅立葉轉換長度為 2048 點。

# Frequency Resolution = Fs / N (9)

# ◆ 窗函數(Window function)的選擇

概念上,當窗函數之旁辦衰減程度越大時,能使得傳立葉頻譜能量集中,故濾波效果 越佳[21]。在窗函數之疊加比例方面,由於我們將採 20.48KHz 作為其取樣頻率,因此頻率 解析度為 10(Hz/sample),又因傳立葉轉換長度設定為 2048 點,為 2 的冪次方,故疊加比 例以 2 的整數倍做重疊運算(WOLA),疊加比例可分別為 50%(M=2)、75%(M=4)、 87.5%(M=8)。表 2 為本計畫評估之各窗函數的適當疊加比例與其頻譜能量反映。各個窗函 數使用 WOLA 架構所搭配之疊加比例和旁瓣衰減程度亦如表所示。分析結果顯示,Hanning 窗函數的頻譜能量集中,並且適當之疊加較低,濾波效果佳及運算複雜度較低,適用於用 來設計 ANSI S1.11 class1 的濾波器組。

綜合上述,相關設計之瓶頸點值得未來進行更進一步的探討與分析。而以快速傳立葉轉換為基礎的濾波器組(FFT-based Filterbank)來發展 ANSI S1.11 class0 的濾波器組設計,相關系統架構設計流程圖如圖 8(a)所示。而針對 18 個頻帶 N 點的傳立葉運算(FFT),不同於過去的設計在實現上不需要考量每個頻帶間的 group delay 影響,這是因為 FFT 的整體群延遲為固定值。為了避免整體延遲造成的邊界不連續,我們採用 hanning window 來達到消除 crack noise。再透過 post-window 來做 smoothing 重建的信號,我們需要再進行 constant-multiplication 來振福大小的

調整。圖 8(b)為我們提出的架構設計,可由資料緩衝記憶體、係數記憶體、控制器、與FFT 濾 波器組來組成,其中,FFT 的運算採用著名的 radix-2 架構來進行 2048 點的轉換設計。然而, 該根據設計出的電路,整體群延遲約為 100ms,雖然可符合 class-0 的規範,但是在實際應用上 可能無法應用於實際的即時交談情境,即便如此,仍可用線上影片觀賞的即時撥放系統中。相 關實現結果如表1所示,晶片佈局圖與其整體功率消耗分佈如圖9所示。



圖 8(a) 本計畫所提出之以快速傅立葉轉換為基礎的 ANSI S1.11 class0 濾波器組設計流程



圖 8(b) 本計畫所提出之以快速傳立葉轉換為基礎的硬體架構設計

表 1 REALIZATION RESULTS OF THE PROPOSED ANALYSIS AND SYNTHESIS FFT-BASED FILTERBANKS

|                                   | TSMC Technology          | 1P6M 0.18 um                         |
|-----------------------------------|--------------------------|--------------------------------------|
| Maximum and Real-Time Clock Rates |                          | 17MHz & 3.32 MHz                     |
|                                   | Data Buffer (SRAM)       | 4x1024x32                            |
| Memory                            | Window Fun. (ROM)        | 1024x16                              |
|                                   | FFT Coeff. (ROM)         | 512x2x16                             |
| Core                              | Area & Power Consumption | $1.96 \text{mm}^2 \& 3.01 \text{mW}$ |



**9** Silicon implementation and gate count distribution

| 表.2 本研究提出之 class0 Filter bank 所使用的各窗函數與其適當專 | 量加比例 |
|---------------------------------------------|------|
|---------------------------------------------|------|

| Window       | Rectangular | Hamming | Hanning | Blackman-Harris | Chebyshev |  |
|--------------|-------------|---------|---------|-----------------|-----------|--|
| % of         | 0%          | 75%     | 75%     | 87.5%           | 87.5%     |  |
| Overlap-add  | (M=1)       | (M=4)   | (M=4)   | (M=8)           | (M=8)     |  |
| Spectrum     | Deer        | Esia    | Cood    | Cood            | Dest      |  |
| Energy       | ergy        |         | Good    | Good            | Dest      |  |
| Spectral     |             |         |         |                 |           |  |
| Leakage PNSR | 16.7        | 34.1    | 53.5    | 62.1            | 67.7      |  |
| (dB)         |             |         |         |                 |           |  |

及其頻譜能量反映結果

◆ Matching error 的比較與結果

Wei等人所提出的研究[14-16]係根據<u>www.earinfo.com</u>網站所提供的audiogram進行各種的濾波器組測試,因此,本計畫也採取同樣的方式進行模擬。於此,我們採用Wei等人最新發表[16]所使用的兩個Audiogram Examples進行比較,聽力檢測圖的匹配結果表3所示。 表3統計的結果呈現,我們所提出的濾波器組的頻率響應與audiogram所繪的曲線的相減後 具有較低的matching error。由於FFT-based濾波器組的頻率解析度為10(Hz/sample),意味著可依據每個頻率點(Frequency Bin)做最佳化的聽力匹配,其匹配流程圖如圖10所示。

一開始我們先設定欲匹配之聽力紀錄單(Audiogram Example A~B),再來設定可接受之 最大匹配誤差值,找出每個頻率點所需之增益大小並合成整個頻帶,計算FFT-based濾波器 組合成的頻譜與欲匹配的聽力紀錄單之最大誤差值,判斷該值是否小於可接受之最大匹配 誤差值。如誤差值小於可接受之最大匹配誤差值,則完成整個聽力匹配的動作;否則,將 依情況是否調整可接受之最大匹配誤差值,如果不調整,則須先找出造成最大誤差的頻率 點,調整該頻率點之增益大小,依循此設計流程完成整個匹配動作。經過逐點最佳化之匹 配流程動,我們所提出的FFT-based濾波器組可以獲得最低的匹配誤差。

|                           | Maximum matching error (dB) |           |  |  |
|---------------------------|-----------------------------|-----------|--|--|
| Type of filterbanks       | Example A                   | Example B |  |  |
| Uniform (direct design)   | 6.39                        | 5.86      |  |  |
| [14]                      | 9.61                        | 3.63      |  |  |
| [16]                      | 4.82                        | 2.67      |  |  |
| Proposed I (Quasi-class2) | 3.50                        | 1.78      |  |  |
| Proposed II (class2)      | 2.40                        | 1.53      |  |  |
| Proposed III (class0)     | 0.046                       | 0.023     |  |  |

表.3 各種濾波器組的 matching error 比較



**B** 10 Calibration flow of minimum matching error for audiogram

#### **III.** COMPARISON RESULTS

在現有文獻中,僅有我們針對頻帶不匹配群延遲提出校正技術,因此整體演算法能夠提供 較好的語音理解度。在其他項目的評比,我們所提出的 class2 規格的演算法(Proposed-II),亦使 用階數為 2 的多速率架構來實現,而群延遲為 17ms 是屬於哈斯效應所規範的延遲範圍(5ms – 25ms),除此之外,每個樣本信號所需乘法運算量為 176,每個樣本所需之加法運算量為 1254, 與 Kuo 等人[7]比較,雖然乘法、加法運算量皆較大,但能有效的改善群延遲的問題;與 Liu 等 人 [13]比較,雖然群延遲與加法運算量相較來的高些,但整體的乘法運算量可降低 25%且滿足 class2 的規格。若考量採以同樣標準來比較,我們亦提出近似 class2 之演算法(Proposed-I),其 分析結果顯示相較於 Liu 等人 [13]的設計方案,我們擁有極低的乘法運算量。另一方面,若嘗 試以不減少多率架構(如 Kuo 等人 [7])的階數來設計演算法,我們所提出的演算法(Proposed-III) 除了能滿足 class0 的規範,以及大量減少複雜度外,亦增加許多的群延遲。這意味著若我們在 未來必須投注心力於減少群延遲的影響,如此一來才能在運算複雜度與群延遲上的取捨上獲得 平衡。

| Method                | [7]        | [7]      | [12]         | [13]         | Proposed-I   | Proposed-II | Proposed-III  |
|-----------------------|------------|----------|--------------|--------------|--------------|-------------|---------------|
| Sampling rate         | 24 kHz     | 24 kHz   | 24 kHz       | 24 kHz       | 24 kHz       | 24 kHz      | 20.48 kHz     |
| Number of Bands       | 18         | 18       | 18           | 18           | 18           | 18          | 18            |
| Level                 | Class2     | Class2   | Quasi-class2 | Quasi-class2 | Quasi-class2 | Class2      | Class0        |
| Structure             | Multi-rate | Parallel | Parallel     | Multi-rate   | Multi-rate   | Multi-rate  | Radix2-FFT-FB |
| Multiplication/sample | 140        | 3270     | 570          | 235          | 72           | 176         | 145           |
| Addition/sample       | 278        | 6545     | unlisted     | 495          | 518          | 1254        | 409           |
| # of Coeff.           | 91         | 3270     | unlisted     | 506          | 106          | 165         | 1536          |
| Multi-rate Stage      | 6          | N/A      | N/A          | 3            | 2            | 2           | N/A           |
| Maximum Group Delay   | 78 ms      | 28 ms    | 9.8 ms       | 10 ms        | 12.56 ms     | 17 ms       | 100 ms        |
| Group Delay Mismatch  | Yes        | Yes      | Yes          | Yes          | No           | No          | No            |

表4 各種濾波器組之設計比較

### IV. 結語

本計畫近程將以演算法開發為主,同時搭配FPGA與客製化晶片IC實現。關於濾波器組的 演算法的部分實現後,提供相關模組給中山醫學大學聽語系的郭明雯教授與羅敦信老師,進行 成果效能的評斷與試驗,以利後續臨床實驗與未來合作。目前所研發出的演算法與硬體共計有 Quasi-ANSI S1.11 class-2、ANSI S1.11 class-2、以及ANSI S1.11 class-0,相信對於未來助聽輔具 之發展將能導引更先進的核心技術。

### ACKNOWLEDGMENTS

The PI would like to thank Prof. Sheau-Fang Lei for his previously help to give abundant manpower and some suggestions of this research topic.

## REFERENCES

- [1] "ANSI Standard S1.11-2004," Specification for Octave-Band and Fractional-Octave-Band Analog and Digital Filters.
- [2] D. Byrne, H. Dillon, T. Ching, R. Katsch, and G. Keidser, "NAL-NL1 procedure for fitting nonlinear hearing aids: characteristics and compositions with other procedures," Journal of the American Academy of Audiology, vol. 12, no. 1, pp. 37-54, Jan. 2001.
- [3] G. Keidser, H. Dillon, M. Flax, T. Ching, and S. Brewer, "The NAL-NL2 prescription procedure," Audiology Research, vol. 1 no. 1, pp. 88-90, 2011.
- [4] L. S. Nielsen and J. Sparsoe, "Designing asynchronous circuits for low power: An IFIR filter bank for a digital hearing aid," Proceedings of the IEEE, vol. 87, no. 2, pp. 268 -281, 1999.
- [5] H. Li, G. A. Jullien, V. S. Dimitrov, M. Ahmadi, and W. Miller, "A 2-digit multidimensional logarithmic number system filter bank for a digital hearing aid architecture," Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, pp.760 -763, 2002.
- [6] K.-S. Chong, B.-H. Gwee and J.-S. Chang, "A 16-channel low-power nonuniform spaced filter bank core for digital hearing aid," IEEE Tran. Circuits Syst. I, Reg. Papers, vol. 53, no. 9, pp.853 -857, 2006.
- [7] Y.-T. Kuo, T.-J. Lin, Y.-T. Li, and C.-W. Liu, "Design and implementation of low-power ANSI S1.11 filter bank for digital hearing Aids," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 7, pp. 1684-1696, 2010.
- [8] M. W. Spitzer, A. D. Bala and T. T. Takahashi, "A Neuronal Correlate of the Precedence Effect Is Associated With Spatial Selectivity in the Barn Owl's Auditory Midbrain," Journal of Neurophysiology, vol. 92, no. 4, pp. 2051-2070, Oct. 2004.
- [9] H. Haas, "On the influence of a single echo on the intelligibility of speech," Acustica, vol. 1, pp. 48–58, 1951.
- [10] http://www.synaudcon.com/site/articles/its-about-time-the-effective-haas-effect/
- [11] J. Agnew, "An overview of digital signal processing in hearing instruments," The Hearing Review, July 1997.
- [12]C.-H. Lin, K.-C. Chang, M.-H. Chuang and C.-W. Liu, "Design and implementation of 18-band Quasi-ANSI S1.11 1/3 octave filter bank for digital hearing aids," VLSI Design, Automation, and Test (VLSI-DAT), vol. 23, no. 25, pp. 1-4, April 2012.
- [13]C.-W. Liu, K.-C. Chang, M.-H. Chuang and C.-H. Lin, "10-ms 18-Band Quasi-ANSI S1.11 1/3-Octave Filter Bank for Digital Hearing Aids," IEEE Tran. Circuits Syst. I, Reg. Papers, vol. 60, no. 3, pp. 638-649, 2013.

- [14] Y. Lian and Y. Wei, "A Computationally Efficient Non-Uniform FIR Digital Filter Bank for Hearing Aid," IEEE Tran. Circuits Syst. I, Reg. Papers, vol. 52, no. 12, pp. 2754-2762, Dec. 2005.
- [15] Y. Wei and Y. Lian, "A 16-band nonuniform FIR digital filterbank for hearing aid," Proc. IEEE Biomed. Circuits Syst. Conf., pp.186 -189, 2006.
- [16] Y. Wei and D. Liu, "A Reconfigurable Digital Filterbank for Hearing-Aid Systems with a Variety of Sound Wave Decomposition Plans," IEEE Trans. on Biomedical Engineering, vol. 60, no. 6, pp.1628-1635, June 2013.
- [17] A. Petrovsky, M. Parfieniuk, and K. Bielawski, "Psychoacoustically motivated nonuniform cosine modulated polyphase filter bank," in Proc. 2nd Int. Workshop on Spectral Methods and Multirate Signal Process. (SMMSP), Toulouse, France, 7–8 Sep. 2002, pp. 95–101.
- [18] P. Vary, "An adaptive filter-bank equalizer for speech enhancement," Signal Process., vol. 86, no.6, pp. 1206–1214, 2006.
- [19] Marek Parfieniuk, and Alexander Petrovsky, Near-Perfect Reconstruction Oversampled Nonuniform Cosine-Modulated Filter Banks Based on Frequency Warping and Subband Merging, International Journal of Electronics and Telecommunications, Vol. 58, No. 2, pp. 177-192, May 2012.
- [20] Shin-Chi Lai, Chih-Hao Liu, Ling-Yi Wang Shin-Hao Chen, and Ke-Horng Chen, "11.25-ms-Group-Delay and Low-Complexity Algorithm Design of 18-band Quasi-ANSI S1.11 1/3 Octave Digital Filter Bank for Hearing Aids," IEEE Transactions on Circuits and Systems-I: Regular Papers, vol. 62, no. 6, pp. 1572–1581, June 2015.
- [21] J. O. Smith, "Audio FFT filter banks," in Proc. 12th Int. Conf. Digital Audio Effects(DAFx-09), Como, Italy, Sep.2009.
- [22] "Versatile FFT Supports Accurate 1/3 Octave Analysis," Copyright © 2010, Microstar Laboratories, Inc., http://www.mstarlabs.com/docs/tn257.pdf
- [23] R. Crochiere, "A weighted overlap-add method of short-time Fourier analysis/synthesis," IEEE Transactions on Acoustics, Speech, Signal Processing, vol. ASSP-28, pp. 99–102, Feb. 1980.

# A LOW-COMPLEXITY RECURSIVE ALGORITHM AND COMPACT HARDWARE DESIGN FOR SLIDING DISCRETE FOURIER TRANSFORM

Shin-Chi Lai and Chia-Chun Tsai

Department of Computer Science and Information Engineering, Nanhua University

### Wen-Ho Juang, Yi-Hsiang Juan and Ching-Hsing Luo

Department of Electrical Engineering, National Cheng Kung University

#### ABSTRACT

This paper proposes a novel sliding discrete Fourier transform (DFT) algorithm and architecture design for efficient computation of time-frequency spectrum information. Since the sliding process is handled sample by sample, the spectral bin output data rate can be same as the input data rate. Under the conditions of M-sample real input sequence (M=256) and N-point recursive DFT computation (N=64), the advantages of the proposed method are summarized as follows: 1) Proposed-I requires less computational complexity than Krzysztof Duda's method. The number of multiplication operations of the proposed algorithm is 44,864 and achieves an 80.35% reduction. In addition, 54.91% of addition operations are reduced. 2) For computing each frequency bin, Proposed-I only requires 4addition and 2 multiplication operations after the first spectral component has been finally calculated; 3) Proposed-II utilizes three registers and a re-timing scheme to effectively shorten and balance the critical path of the proposed design. Moreover, the number of coefficients is reduced by 75% compared to Krzysztof Duda's method. For FPGA implementation, the proposed design can be operated at 43.5 MHz processing rate, which can easily meet the requirements of real-time applications. Therefore, this approach is suitable for real-time analysis of time-frequency spectrum in the future.

Key words: Discrete Fourier transform (DFT), Recursive DFT (RDFT), Sliding DFT (SDFT).

# I. INTRODUCTION

In many digital signal processing applications, Discrete Fourier Transform (DFT) is widely employed to analyze signal power in the frequency domain. To observe the variety of time-frequency spectrum in detail, a sliding sinusoidal transform [1], such as sliding DFT (SDFT) [2, 3, 4], is proposed to offer enough information in a short-time domain. Figure 1 demonstrates the sliding window for 8-point DFTs. The data in Fig. 1 (a) is sampled during a short period, and Fig. 1 (b) displays the next sliding window after Fig. 1 (a). Due to the

recursive relationship between two subsequent local transform spectra [2, 3], the recursive DFT can be combined with the sliding unit to implement a SDFT computation. Goerzel's well-known DFT [5] is useful and powerful in certain practical frequency bins for dual-tone multi-frequency signaling (DTMF) recognition applications. Unlike fast Fourier transform (FFT) [6, 7, 8] calculations, it uses less computational workload while fewer frequency bins are required. Recently, various recursive DFTs (RDFTs) [9-14] have been well developed. Lai *et al.* [10, 11, 12] achieved great improvement in terms of computation cycles and computational complexity. To achieve high performance computation, a hybrid architecture design of RDFT and radix-2<sup>r</sup> FFT

DOI:10.6329/CIEE.2016.2.05

Manuscript received Oct. 28, 2015; and revised Feb. 26, 2016; accepted Apr. 29, 2016.

This work was supported in part by the Ministry of Science and Technology, Taiwan under grant number 103-2221-E-343-001, 103-2221-E-343-005, and 104-2221-E-343-003. This work was also supported in part by NHU Research Project under grant number Y103000951 and Y103000958.

[14] was recently proposed for variable-transform-length DFT. Krzysztof Duda proposed a powerful mSDFT algorithm using a modulated concept [4] to obtain a guaranteed accurate and stable SDFT computation. However, it requires heavy computational workload per output sample, *i.e.* 10 real multiplications and 7 real additions more than previous works [2, 3]. Based on the experiences of the above works, a low-complexity and low-cost RDFT is further extended to develop a novel SDFT algorithm with a comparable accuracy to Jacobsen and Lyons' SDFT [2, 3]. Since SDFT is used in various signal processing applications, such as for ultrasonic range finders [15], ECG noise cancelers [16], and phase locking schemes [17], a real-time, low-cost and low-complexity SDFT algorithm has become an essential issue not only for software realization but also for hardware implementation.

In this work, we focus on not only reducing the algorithm computational complexity but also reducing the hardware implementation cost. The proposed algorithm is derived in detail and then is mapped to a hardware architecture. By using some basic very-large-scale integration (VLSI) schemes, a low-cost and high-speed SDFT hardware accelerator can be efficiently constructed.

The rest of this paper is organized as follows: Section II provides a detailed algorithm derivation and then the system transfer function of the proposed SDFT formula is given. Section III demonstrates the compact architecture design of the proposed SDFT, and Section IV compares various performance metrics of the proposed method with other approaches. Finally, conclusions are given in Section V.

# II. PROPOSED ALGORITHM DERIVATION OF NOVEL SLIDING DFT COMPUTATION

The N-point DFT computation with input (x[n]) and output (X[k]) sequences is defined as follows:



Fig. 1 Signal window for two 8-point DFTs: (a) data samples in the first computation and (b) second computation samples.

$$X[k] = \sum_{n=0}^{N-1} x[n] \times W_N^{nk}, \text{ where } W_N^{nk} = e^{j2\pi nk/N} \text{ and } k = 0 \text{ to } N-1.$$
(1)

The transfer function of the Goertzel DFT [5] is further derived as a z-transform formula (2) by a recursive difference equation (3). Notice the recurrent loops of Goertzel DFT occurs (N+1) times in total, and the input sequence should be set zero during the  $(N+1)^{th}$  time slot.

$$H_G(z) = \frac{1}{1 - W_N^k z^{-1}} = \frac{1 - W_N^{-k} z^{-1}}{1 - 2\cos(2\pi k / N) z^{-1} + z^{-2}}$$
(2)

$$y[n] = (W_N^k \times y[n-1] + x[n]), \text{ where } y[-1] = 0$$
(3)

By following up Jacobsen and Lyons' approach [2], the system transfer function of sliding Goertzel DFT [5] can be defined as (4). Comparing (2) with (4), the major difference is that the numerator of  $H_{SG}(z)$  is multiplied by (1- $z^{-N}$ ).

$$H_{SG}(z) = \frac{\left(1 - W_N^{-k}\right)\left(1 - z^{-N}\right)}{1 - 2\cos\left(2\pi k \,/\, N\right)z^{-1} + z^{-2}} \tag{4}$$

A previous work [10] had a different difference equation (5), so the transfer function of Lai *et al.*'s DFT was derived as (6). Eq. (6) is expressed as (7), where  $\theta_k$  was set to  $2\pi k/N$ . To reduce the multiplication of  $\cos(\theta_k)$ , the coefficients of  $\cos(\theta_k)$  and  $2\cos(\theta_k)$  were shared by using one real multiplier and a shifter. Compared to Goertzel's DFT, the total recurrent times of Lai *et al.*'s RDFT only takes N iterations due to the derivation of (5).

$$m[n] = W_N^k \times (m[n-1] + x[n]), \text{ where } m[-1] = 0$$
(5)

$$H_{Lai}(z) = \frac{W_N^k - z^{-1}}{1 - 2\cos(2\pi k / N)z^{-1} + z^{-2}}$$
(6)

$$H_{Lai}(z) = \frac{\cos(\theta_k) + j\sin(\theta_k) - z^{-1}}{1 - 2\cos\theta(\theta_k) z^{-1} + z^{-2}} = \frac{j\sin(\theta_k) + (\cos(\theta_k) - z^{-1})}{1 - z^{-1} (2\cos(\theta_k) - z^{-1})}$$
(7)

By multiplying the factor  $(1-z^{-N})$  into  $H_{Lai}(z)$ , the system transfer function of the proposed SDFT, *i.e.*  $H_{SLai}(z)$  yields

$$H_{SLai}(z) = \frac{\left[j\sin(\theta_{k}) + \left(\cos(\theta_{k}) - z^{-1}\right)\right] \times \left(1 - z^{-N}\right)}{1 - z^{-1} \left(2\cos(\theta_{k}) - z^{-1}\right)}.$$
 (8)

Eq. (8) shows that only two real multiplications are required for the coefficients of  $sin(\theta_k)$  and  $cos(\theta_k)$  in the S. C. Lai, W. H. Juang, Y. H. Juan, C. H. Luo and C. C. Tsai: A Low-Complexity Recursive Algorithm and Compact Hardware 75 Design for Sliding Discrete Fourier Transform

proposed SDFT. Compared with Krzysztof Duda's mSDFT in [4, Fig. 4], the proposed method has less computational complexity. From the view of algorithm derivation, the second order Goertzel's architecture can be derived from one-pole SDFT, and then Goertzel's DFT takes less computational complexity than one-pole DFT. Considering the real-valued inputs, only N+2 multiplications are required for Goertzel's algorithm. However, 3N multiplications are required for one-pole DFT because of the input data being multiplied by a complex coefficient. Here, we assumed that one complex multiplication only require 3 real multiplications. The proposed SDFT based on previous approach [10] has the advantages of lower complexity, shorter critical path, and the fewest hardware costs compared with Goertzel SDFT and one-pole SDFT. In addition, a hardware-sharing scheme can be employed to reduce one register and one multiplier, while the coefficient of the term  $Z^{-1}$  of denominator in Eq.(8) is relatively similar to the computation of numerator in Eq.(8) as Fig. 2 shown.

Since this compact structure of recursive DFT was designed with the advantages of short critical path and low computational complexity in previous work [10], it was easily found that a novel sliding DFT algorithm could be directly derived from (5)-(8) in this work. Different to [2, 3, 4], the twiddle factor, *i.e.* the complex multiplication, is divided into a cosine term and a sine function in this work. Eq.(8) and Fig. 2 shows that the cosine term of denominator can be shared to that of numerator not only in computation but also in realization. This is the rationale of the proposed algorithm greatly reducing the computational complexity compared with other related SDFT algorithms [2-4]. It should be noticed that the sliding window is processed sample-by-sample so that the next desired single bin calculated by using the DFT algorithm of previous work [10] would take N real multiplications. However, it only requires 2 real multiplications in this work. For the stability issue, the proposed method is similar to Goertzel's algorithm which has two poles as shown in (4). In general, a damping factor r, which be applied to be a radius of r inside the unit circle, can guarantee the stability. The details have been well discussed in [2, 3].

# III. ARCHITECTURE DESIGN OF THE PROPOSED NOVEL SDFT

#### A. Design Concept of Kernel Processing Unit

According to (8), the architecture design of the proposed sliding recursive DFT algorithm can be easily mapped into Fig. 2. It can be observed that the critical path of the proposed algorithm contains  $(T_M + 4T_A)$ , where  $T_M$  and  $T_A$  are the time periods of the multiplier and adder, respectively. After retiming and adding three registers, Fig. 2 becomes a higher speed SDFT as given in Fig. 3. The coefficient requirement in this work as



Fig. 2 Proposed Novel Sliding DFT Algorithm (Proposed I).



Fig. 3 Proposed Low-critical-path SDFT Hardware Design (Proposed II).

well as previous work [10] can be easily reduced by 50% due to the symmetric identities of sine and cosine. Since the input sequence is a real number, the computational complexity of the proposed SDFT algorithm takes 2 real multiplications and 4 real additions in total for the calculation of each frequency bin after the first spectral component is calculated.

#### **B.** Overall Hardware Architecture Design

The overall hardware architecture is divided into the following three main blocks: (1) Control Unit (CU); (2) Memory Unit (MU); (3) Processing Unit (PU). Figure 4 shows the overall architecture design where the MU is composed of RAM and ROM. The ROM and RAM are applied for storing the twiddle factors and implementing the element of Z<sup>-N</sup>, respectively, as mentioned in Fig. 3. The input sequence x[n] is fed to the PU for recursive computation. In general, the proposed hardware accelerator would be integrated with a greater system so it has a handshaking process using signals, *i.e.* xValid and XValid, to notify the input/output data is ready. When the CU receives a signal, it will follow the finite-statemachine rule to determine the action mode ('READ' or 'WRITE') of the MU. From the Initial State, MU only executes the action of 'WRITE' until the next state, i.e. Pre-Read and Write State, is ready for entry. The CU further controls the relative memories to finish the corresponding action modes according to the fed data form MU. When the CU senses that there is no input data, *i.e.* xValid is 'LOW', the coefficient ROM can process the request of coefficient update. Concurrently, the CU determines if the addressing of coefficient ROM is increased or decreased. Finally, the sliding DFT operation is finished. The timing diagram of the overall architecture is displayed in Fig. 5.



Fig. 4 Proposed Overall Hardware Architecture Design.



Fig. 5 Timing Diagram of the Proposed Architecture Design.

#### C. Reduction Method for Coefficient-ROM Usage

According to (8), 2N word ROM size is required to store the coefficients for realization of an N-point SDFT computation. By adopting the symmetric identity of sinusoid function, the number of coefficient around the region of  $2\pi$ , as shown in Fig. 6(a), can be directly mapped into  $\pi/4$ . This means that 2N coefficients would be reduced by 75%. The cosine and sine coefficients only change with the index of k for the shifting time samples and would be pre-loaded into registers for further use. Figure 6(b) shows the proposed method to access the desired coefficient. For cosine function, the solid red line starts from k=0 to k=N/4, and then turns back to k=0 according to the dotted red line. Similarly, the solid blue line begins from k=N/4 to k=0, and then would turn back to k=N/4 according to the dotted blue line. This method can be further combined with the up/down counter to implement the control rule of memory access. After the counter generates the memory address, most significant bit can be applied to determine if the operation of complement of the coefficient is needed. Then, the proper coefficient would be provided to the multiplier to calculate the exact result.

#### **D.** The Control of Memory Unit

To efficiently reduce area cost, the memory controller for the access of RAM and ROM is designed into two parts. The first one is RAM controller, and the other is for ROM.



Fig. 6 Relationships between Coefficients of Sine and Cosine.

#### (1) RAM controller design:

SDFT computation requires an accumulated circuit in the first stage to recursively calculate the operation of (1-Z<sup>-N</sup>). This implies that the input data should be carefully stored via the delay element of  $(Z^{-N})$ . To realize this delay element and to solve the synchronizing access of 'READ' and 'WRITE' requests, a direct mapping method instead of FIFO registers is employed via a two-port memory. However, this still costs more area and hardware resources for implementation. Therefore, we adopt two half-size and single-port memories (memory #1 and #2) with a small control circuit to achieve a two-port memory in this work. The behavior of control circuit has three states, i.e. "initial" state, the state of "read action for memory#1 and write action for memory#2," and the state of "write action for memory#1 and read action for memory#2." The timing diagram is shown in Fig. 7. In the initial state, the action modes of these two memories are alternately writing data into memory until (N-1) clock cycles finished. Then, memory#1 continues writing and memory#2 changes the action mode into 'READ' for data pre-loading which is used for the time period of the (N+1)<sup>th</sup> clock cycle. This state is kept executing until the input of SDFT does not have any more data. An informal high-level pseudo code as shown in Fig. 8 is carefully described the address generation rule for memory accesses.

#### (2) ROM controller design:

For N-point sliding window, the cosine and sine functions take 2N coefficients. In the hardware implementation, we can employ the up/down counter and the symmetric identity of coefficients as mentioned above to reduce the number of coefficients by 75%. Figure 9 shows the timing diagram of the cosine coefficient.



Fig. 7 Timing Diagram of RAM Control Rule.

S. C. Lai, W. H. Juang, Y. H. Juan, C. H. Luo and C. C. Tsai: A Low-Complexity Recursive Algorithm and Compact Hardware 77 Design for Sliding Discrete Fourier Transform

| #01 | <b>Initial</b> Address counter = $0$ :                    |
|-----|-----------------------------------------------------------|
| #02 | if (Address_counter $\leq N$ )                            |
| #02 | if (Address_counter is odd)                               |
| #03 | Odd memory state = "Writing"                              |
| #04 | Odd_memory_state = "writing";                             |
| #05 | Even_memory_state = "Idle";                               |
| #06 | Odd_memory_address = $(Address_counter \% N) / 2;$        |
| #07 | Even_memory_address = $(Odd_memory_address + 1) % (N/2);$ |
| #08 | else                                                      |
| #09 | Odd_memory_state = "Idle";                                |
| #10 | Even_memory_state = "Writing";                            |
| #11 | Odd_memory_address = $(Address_counter \% N) / 2;$        |
| #12 | Even_memory_address = Odd_memory_address;                 |
| #13 | end                                                       |
| #14 | else                                                      |
| #15 | if (Address_counter is odd)                               |
| #16 | Odd_memory_state = "Writing";                             |
| #17 | Even_memory_state = " <b>Reading</b> ";                   |
| #18 | Odd_memory_address = $(Address_counter \% N) / 2;$        |
| #19 | Even_memory_address = $(Odd_memory_address + 1) % (N/2);$ |
| #20 | else                                                      |
| #21 | Odd_memory_state = "Reading";                             |
| #22 | Even_memory_state ="Writing";                             |
| #23 | Odd_memory_address = $(Address_counter \frac{\% N}{2};$   |
| #24 | Even_memory_address = Odd_memory_address;                 |
| #25 | end                                                       |
| #26 | end                                                       |

Fig. 8 Pseudo Code for RAM Address Mapping.



Fig. 9 Timing Diagram of ROM Control Rule.

While cosine coefficient is under the action mode of 'READ' from memory, an up counter can be used to produce the required memory address while the index of k is increased with frequency bin changing. Then, the desired value of the cosine coefficient,  $\cos(2*pi*(k+1)/N)$ , for the next time period is prepared after finishing all operations during the time period of the k index. Due to the symmetric identity of the cosine function,  $\cos(2*pi*(k+1)/N)$  can be instead of  $\cos(2*pi*(k+1)/N)$  so that the down counter can be applied to generate the desired memory address. Based on this scheme, the total size of coefficient ROM is N/2 words.

# IV. COMPARISON AND DISCUSSION

In this section, the performance metrics, in terms of multiplication, addition, and coefficients, are evaluated in Table 1 for SDFT and FFT-based SDFT algorithms. Here, we assume that the transform length (N) of DFT is 64, the number of time-domain total sliding samples (M) is 256, and the input signals are real numbers. Notice that one complex multiplication requires four real multiplications and two additions for the following perfor-

mance evaluation. The results shown in Table 1 (a) illustrate that radix-2-based FFT [8] has a total of 244,992 multiplications and 244,992 additions. However, the sliding RDFT algorithms such as Jacobsen and Lyons [2] and the proposed algorithm have much lower computational complexity than FFT-based SDFT computation. Krzysztof Duda's algorithm [4], uses 228,352 multiplications and 207,936 additions for calculating SDFT coefficients. Compared with the latest SDFT approach, the numbers of multiplication and addition for the proposed algorithm are, respectively, 44,864 and 93,760 which have 80.35% and 54.91% reduction, respectively. Table 1 (a) shows the numbers of coefficients and computational cycles of the proposed SDFT algorithm are comparable to that of [2, 4, 8]. For analyzing the execution time of the hardware accelerator, the critical path and computational cycles are both involved. This is the reason the proposed design has similar performance while having with less computational operations. In addition, the number of coefficients for the proposed algorithm is reduced by 75% compared with previous works [2, 4]. Considering the input data might be a complex number, Table 1 (b) shows the computational complexity in detail for the proposed SDFT calculation. The proposed algorithm has the least numbers of multiplication and addition operations compared to than previous approaches [2, 4, 8] because of a compact and simplified recursive kernel design. Compared to [4], the proposed performs 89,728 multiplications and 228,352 additions with the reduction of 67.64% and 25.31%, respectively. If the sliding window (M) is greater than 256, the difference would be even more statistically significant.

Considering the accuracy issue, the second order polynomial in the denominator with 2 complex conjugate poles will make the system become unstable, while these poles are located at the outside of unit circle. Recently, the latest approach [18] clearly discusses and corrects the misconception of Goertzel filter. A detailed proof of Goertzel filter guaranteed stability is also given by Richard Lyons, the co-author of sliding DFT issues [2, 3]. Since the proposed sliding DFT algorithm has the same denominator, the result of guaranteed stability of Goertzel algorithm can be applied in this work. Here, we provide the MSE and SQNR comparisons among the proposed algorithm, standard one-pole SDFT, and Goertzel's SDFT as shown in Fig. 10. By the way, Krzysztof Duda's algorithm [4] is excluded in this comparison because their modulated SDFT (mSDFT) is more accurate than other algorithms as shown in [4, Fig. 5]. But, it requires heavy computational workload per output sample, i.e. 10 real multiplications and 7 real additions more than previous works [2, 3] and this work. The fixed-point simulation results given in Fig. 10 were used to demonstrate how the MSE and SQNR of the computed coefficients vary with the word lengths. Some parameters are setting as follows: (1) Input word length is 16 bits; (2) Coefficient word length is from 16-bit to 24-bit; (3) 10<sup>5</sup> zero-mean random input testing patterns are given; (4) DFT Transform length of N is

| Metrics        | 2003 [2]         | 2010 [4]            | 2014 [8]                    | Proposed I & II      |
|----------------|------------------|---------------------|-----------------------------|----------------------|
| Multiplication | N(3M+4N-3)       | N(10M+16N-16)       | (M+N-1)2Nlog <sub>2</sub> N | N(2M+3N-4)           |
| Addition       | N(4M+7N-4)       | N(9M+15N-15)        | (M+N-1)2Nlog <sub>2</sub> N | N(4M+7N-7)           |
| Coefficient    | 2N               | 2N                  | Ν                           | Ν                    |
| Cpt. Cycle (a) | N(N+1)+M-1       | N <sup>2</sup> +M-1 | Unlisted                    | N2+M-1               |
| Cri Dath (b)   |                  |                     | Imlisted                    | $4T_{A}+T_{M}^{(I)}$ |
| Cri. Paul (b)  | $2I_{A} + I_{M}$ | $2I_A \pm 2I_M$     | Unifisted                   | $2T_A + T_M^{(II)}$  |
| Cpt. Time      | (a)*(b)          | (a)*(b)             | Unlisted                    | (a)*(b)              |

Table 1 (a) Computational Complexity Comparison for Various N-point DFT and FFT Algorithm Applied for Real Input Sliding Application.

Note: Cri. Path = Critical Path; Cpt. Cycle = Computational Cycle.

 

 Table 1 (b)
 Computational Complexity Comparison for Various N-point DFT and FFT Algorithm Applied for Complex Input Sliding Application.

| Metrics        | 2003 [2]                 | 2010 [4]      | 2014 [8]                    | Proposed I & II |
|----------------|--------------------------|---------------|-----------------------------|-----------------|
| Multiplication | N(6M+8N-8) N(12M+20N-20) |               | (M+N-1)2Nlog <sub>2</sub> N | N(4M+6N-6)      |
| Multiplication | 130,560                  | 277,248       | 244,992                     | 89,728          |
| Addition       | N(11M+17N-17)            | N(13M+23N-23) | (M+N-1)2Nlog <sub>2</sub> N | N(10M+16N-16)   |
| Addition       | 248,768                  | 305,728       | 244,992                     | 228,352         |



Fig. 10 Error Analysis of Various SDFT Algorithms with Different Word Length (a) Mean-Square Error; (b) Signal-to-Quantization-Noise Ratio.

setting to 64, iteration loops are chosen to 256, and the desired bin of k is setting to 3. The results show that the proposed SDFT has similar performance than Goertzel's SDFT.

Table 2 compares hardware cost in terms of multiplier-, adder-, multiplexer-, register-, and coefficient-ROM usage. The results demonstrate that the proposed design has the lowest cost compared other approaches. There are only 2 real multipliers, 5 real adders, 69 registers, and 64 coefficients in the implementation of the proposed method. Compared with Krzysztof Duda's [4], the numbers of multiplier and coefficients are, reduced by 80% and 50%, respectively. On the other hand, although previous work [10] shows the least hardware cost, it is still not a suitable solution for SDFT computation. Since it was designed for DFT computation, the computational complexity of single bin would always take (N+1) real multiplications. While the sliding window is shifted sample by sample, the proposed SDFT algorithm only requires 2 real multiplications / per sample. For hardware realization, Table 3 shows the detailed FPGA implementation results. They indicate that the proposed design, which is implemented by Altera Cyclone IV device, can operate at 43.5 MHz. This implies that the

S. C. Lai, W. H. Juang, Y. H. Juan, C. H. Luo and C. C. Tsai: A Low-Complexity Recursive Algorithm and Compact Hardware 79 Design for Sliding Discrete Fourier Transform

Table 2 Hardware Cost Comparison for Various 64-point DFT and FFT Architecture Designs Applied for Real Input Sliding Application.

| Metrics     | 2003 [2] | 2009 [10] | 2010 [4] | 2014 [8] | Proposed I | Proposed II |
|-------------|----------|-----------|----------|----------|------------|-------------|
| Structure   | SDFT     | DFT       | SDFT     | FFT      | SDFT       | SDFT        |
| Multiplier  | 5        | 1         | 10       | 768      | 2          | 2           |
| Adder       | 4        | 8         | 7        | Unlisted | 5          | 5           |
| Register    | 64+2     | 6         | 64+4     | Unlisted | 64+2       | 64+5        |
| Coefficient | 128      | 62        | 128      | 64       | 32         | 32          |

| Table 3 I | FPGA Imp | lementation | Results. |
|-----------|----------|-------------|----------|
|-----------|----------|-------------|----------|

| Hardware  | Cost  | Hardware Block<br>(Implemented by Altera Cyclone IV E EP4CE115F29C7) |           |                  |             |       |  |  |
|-----------|-------|----------------------------------------------------------------------|-----------|------------------|-------------|-------|--|--|
|           |       | Controller Unit                                                      | RDFT Unit | Accumulated Unit | Memory Unit | Total |  |  |
| Combina   | ition | 65                                                                   | 251       | 16               | 2           | 334   |  |  |
| Regist    | er    | 46                                                                   | 128       | 16               | 1           | 191   |  |  |
| DSP #1    |       | 0                                                                    | 16        | 0                | 0           | 16    |  |  |
| DSP #     | \$2   | 0                                                                    | 0 8 0 0   |                  | 0           | 8     |  |  |
| Mamami    | Ram   | 64*16 bits                                                           |           |                  |             | 1024  |  |  |
| wielliory | Rom   | 32*24 bits                                                           |           |                  |             | 768   |  |  |

Note: #1: 9-bits Element Simple Multipliers (SMs); #2: 9×9 / 18×18 SMs

| Combinational area     | 174,037    |  |  |
|------------------------|------------|--|--|
| But / Inv area         | 10,209     |  |  |
| Noncombinational area  | 30,267     |  |  |
| Macro / Black Box area | 534,177    |  |  |
| Net Interconnect area  | 1,024,338  |  |  |
| Total cell area        | 738,481    |  |  |
| Total area             | 1,762,819  |  |  |
| Cell internal power    | 2.5775 mW  |  |  |
| Net Switching power    | 2.7156 mW  |  |  |
| Total dynamic power    | 5.2931 mW  |  |  |
| Cell leakage power     | 16.5311 uW |  |  |

| Table 4 | Synthesis Report of Chip Area and Power by Design |
|---------|---------------------------------------------------|
|         | Complier                                          |

proposed design can produce each frequency bins within the time period of 0.0229  $\mu$ s. It takes a total of 1.47  $\mu$ s for all frequency bins. Under the condition of 1 sound channel and 48 kHz sampling rate, the real-time specific requirement is 20.833  $\mu$ s which is more than that of the proposed 64-point SDFT hardware accelerator design. For chip design, the synthesis report of area and power simulated by Synopsys' Design Complier is shown in Table 4. The total area of the proposed SDFT chip is only 738481, and the total dynamic power of the proposed SDFT chip consumes 5.2931 mW. Overall, the proposed design would be more suitable for real-time application of time-frequency spectrum due to its low complexity.

### V. CONCLUSION

In this paper, a low-cost and low-complexity sliding DFT algorithm is developed and designed for time-frequency spectra calculation. The comparison results proved that the proposed method has better performance in terms of computational complexity, hardware resource usage and computational speed than other related approaches. It is more powerful method for future applications in digital signal processing. Finally, it is noticed that the hardware cost of the proposed design totally takes 4 multipliers, 10 adders, and 138 registers in SDFT implementation, while complex-valued inputs (*i.e.* real-part and imaginary-part inputs) are used.

### ACKNOWLEDGMENTS

This work was supported in part by the Ministry of Science and Technology, Taiwan under grant number 103-2221-E-343-001, 103-2221-E-343-005, and 104-2221-E-343-003. This work was also supported in part by NHU Research Project under grant number Y103000951 and Y103000958. The authors would like to thank Prof. Sheau-Fang Lei's help for this research topic.

# REFERENCES

- Vitaly Kober, "A Fast Algorithms for the Computation of Sliding Discrete Sinusoidal Transforms," IEEE Transactions on Signal Processing, vol. 52, no.6, pp. 1704-1710, June 2004.
- [2] E. Jacobsen and R. Lyons, "The sliding DFT,"IEEE Signal Processing Mag., vol. 20, no. 2, pp.74-80, Mar. 2003.
- [3] E. Jacobsen and R. Lyons, "An update to the sliding DFT," IEEE Signal Processing Mag., vol. 21, no.1, pp. 110-111, Jan. 2004.
- [4] Krzysztof Duda, "Accurate, Guaranteed Stable, Sliding Discrete Fourier Transform," IEEE Signal Processing Mag., vol. 27, no.6, pp. 124-127, Nov. 2010.
- [5] G. Goertzel, "An algorithm for the evaluation of finite trigonometric series," Amer. Math. Mon., vol. 65, pp. 34-35, Jan. 1958.
- [6] Koushik Maharatna, Eckhard Grass, and Ulrich Jagdhold, "A 64-Point fourier transform chip for high-speed wireless LAN application using OFDM," IEEE Journal of Solid-State Circuits, vol. 39, no. 3, pp. 484-493, Mar. 2004.
- [7] C. Yu, M.H.Yen, P.A. Hsiung, et al, "A Low-Power 64-point Pipeline FFT/IFFT Processor for OFDM Applications," IEEE Trans. Consum. Electron., vol. 57, no.1, pp.40-45, 2011.
- [8] V. Arunachalam, and Alex Noel Joseph Raj, "Efficient VLSI implementation of FFT for orthogonal frequency division multiplexing applications," IET Circuits Devices Syst., vol. 8, no. 6, pp. 526-531, Jun. 2014.
- [9] L. D. Van, C. T. Lin, and Y. C. Yu, "VLSI Architecture for the Low-computation Cycle and Power-efficient Recursive DFT/IDFT Design," IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E90-A, no. 8, pp. 1644-1652, Aug. 2007.
- [10] S. C. Lai, S. F. Lei, C. L. Chang, C. C. Lin and C. H. Luo, "Low Computational Complexity, low Power, and Low Area Design for the Implementation of Recursive DFT and IDFT Algorithms," IEEE Trans. Circuits Sys. II, vol. 56, no. 12, pp. 921-925, Dec. 2009.
- [11] S. C. Lai, W. H. Juang, C. C. Lin, C. H. Luo, and S. F. Lei, "High-Throughput, Power-Efficient, Coefficient-Free and Reconfigurable Green Design for Recursive DFT in a Portable DRM Receiver," International Journal of Electrical Engineering, vol. 18, no.3, pp. 137-145, June 2011.
- [12] S. C. Lai, Y. S. Lee, and S. F. Lei, "Low-Power and Optimized VLSI Implementation of Compact RDFT Processor for the Computations of DFT and IMDCT in a DRM and DRM+ Receiver," J. Low Power Electron. Appl., vol. 3, no. 2, pp. 99-113,

May 2013.

- [13] S. C. Lai, W.-H. Juang, Y. S. Lee, and S. F. Lei, "High-Performance RDFT Design for Applications of Digital Radio Mondiale," IEEE International Symposium on Circuits and Systems (ISCAS), pp.2601-2604, Beijing China, May 19-23, 2013.
- [14]S. C. Lai, W. H. Juang, Y. S. Lee, S. H. Chen, K. H. Chen, C. C. Tsai, and C. H. Lee, "Hybrid Architecture Design for Calculating Variable-Length Fourier Transform," IEEE Transactions on Circuits and Systems-II: Express Briefs, in press, Sept. 20 2015.
- [15] P. Sumathi, and P.A. Janakiraman, "SDFT-Based Ultrasonic Range Finder Using AM Continuous Wave and Online Parameter Estimation," IEEE Transactions on Instrumentation and Measurement, vol. 59, no. 8, pp. 1994-2004, August 2010.
- [16] M.Z.U. Rahman, R.A. Shaik, and D.V.R.K. Reddy, "Efficient and Simplified Adaptive Noise Cancelers for ECG Sensor Based Remote Health Monitoring," IEEE Sensors Journal, vol. 12, no. 3, pp. 566-573, March 2012.
- [17] S. Mishra, D. Das, R. Kumar, and P. Sumathi, "A Power-Line Interference Canceler Based on Sliding DFT Phase Locking Scheme for ECG Signals," IEEE Transactions on Instrumentation and Measurement, vol. 60, no. 1, pp. 132-142, January 2015.
- [18] Richard Lyons, "Correcting an Important Goertzel Filter Misconception," the website of dsprelated.com, July 2015. http://www.dsprelated.com/ showarticle/796.php



**Shin-Chi Lai** was born in Taichung, Taiwan, R.O.C., in 1980. He received a B.S. degree in Electronic Engineering from ChienKuo Technology University, Changhua, Taiwan, in 2002. Then, he received his M.S. degree in Electronic Engineer-

ing from National Yunlin University of Science and Technology, Yunlin, Taiwan, in 2005. Recently, he received the Ph.D. degree at National Cheng Kung University, Tainan, Taiwan, in 2011. From Oct. 2011 to Jul. 2013, he had been an Assistant Research Fellow of the Department of Electrical Engineering at National Cheng Kung University, Tainan, Taiwan. Now, he has been an assistant professor of the Department of Computer Science and Information Engineering at Nan Hua University, Chiayi, Taiwan. His main



research interests are in signal processing and its circuit design, especially for speech, audio, biomedical, and multimedia applications.

**Wen-Ho Juang** was born in Taiwan. He received his M.S. degree in the Department of Electrical Engineering, S. C. Lai, W. H. Juang, Y. H. Juan, C. H. Luo and C. C. Tsai: A Low-Complexity Recursive Algorithm and Compact Hardware 81 Design for Sliding Discrete Fourier Transform

National Cheng Kung University, Tainan, Taiwan, in 2010. From Oct. 2011 to Sep. 2014, he had been an engineer in the Data Center Networking BU, MediaTek Inc., Hsinchu, Taiwan. He is currently pursuing the Ph. D. degree in the Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan; His research interests include digital signal processing and VLSI system design.



**Yi-Hsiang Juan** was born in Taiwan. He received the M.S. degree in electrical engineering from National Taipei University, New Taipei City, Taiwan, in 2011. Currently, he is working toward the Ph.D. degree in electrical engineering from National Cheng Kung University (NCKU),

Taiwan. His research interests include analog dedesign, analog-to-digital converter design, and Mixed-signal design.



**Ching-Hsing Luo** received a B.S. degree in electrophysics from National Chaio Tung University, Hsinchu, Taiwan, R.O.C., an M.S. degree in electrical engineering from National Taiwan University, in 1982, an M.S. degree in biomedical engineering from Johns

Hopkins University, Baltimore, MD, in 1987, and a Ph.D. degree in biomedical engineering from Case Western Reserve University, Cleveland, OH, in 1991. He has been a Full Professor in the Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C., since 1996 and was honored as a Distinguished Professor in 2005. His research interests include biomedical instrumentation-on-a-chip, assistive tool implementation, cell modeling, signal processing, home automata, RFIC, gene chips, and Pulse Diagnosis Platform in Traditional Chinese Medicine.



**Chia-Chun Tsai** received his B.S. degree in Industrial Education from National Taiwan Normal University, Taipei, Taiwan, ROC, in 1978 and both M.S. and Ph.D. degrees in Electrical Engineering from National Taiwan University, Taipei,

Taiwan, ROC, in 1987 and 1991, respectively. From 1978 to 1989, he was a specialist teacher of electronic maintenance at Taiwan Provincial Hsin-Hua and Taipei Municipal Nan-Kang Technology High Schools, respectively. From 1989 to 2005, he was an Instructor, an Associate Professor, and then a Full Professor at the Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC. From 1994 to 1995 and 2000 to 2001, he was working on postdoctoral research at University of California, San Diego, and North Carolina State University, respectively, in USA. Since 2005, he has been with the Department of Computer Science and Information Engineering, Nanhua University, Chiayi, Taiwan, ROC, where he is currently a Full Professor. His current research interests include VLSI design automation and mixed-signal IC design.

# Low-complexity hopping DFT design based on a compact recursive structure

W.-H. Juang, S.-C. Lai<sup>™</sup>, K.-H. Chen, W.-K. Tsai and C.-H. Luo

A novel hopping discrete Fourier transform (DFT) algorithm and its architecture design for efficiently computing time-frequency spectra are presented. Since the sliding process is adopted sample by sample, the spectral bin output data rate is the same as the input data rate. Under the conditions of an M-sample complex input sequence (M=256), and N-point recursive DFT computation (N=64) for time hop L (L=4), the proposed method has the following advantages: (i) the computational complexity of Proposed-I requires only four complex additions and four complex multiplications for each frequency bin, after the first spectral component has been finally calculated; (ii) Proposed-II utilises a re-timing scheme to greatly shorten and balance the critical path; (iii) Proposed-II is less computationally complex than Wang et al.'s method, as the numbers of multiplication and addition operations in the proposed algorithm are 768 and 1024, representing reductions of 50 and 20%, respectively. In addition, the number of coefficients can be reduced by 50% compared with Park et al.'s method. In the FPGA implementation, the proposed design can be operated at 47.26 MHz. It is thus more suitable for use with real-time analytic applications of time-frequency spectra.

Introduction: In many applications of digital signal processing, the discrete Fourier transform (DFT) method plays a key role and has been widely employed to analyse signal power in the frequency domain. Several high-efficiency mathematical algorithms have been developed for fast DFT computation [1, 2]. Recently, the sliding DFT (SDFT) [3-6] process has applied on a sample-by-sample basis. If the recursive relationship between two subsequent local transform spectra [4, 5] can be found, the recursive DFT can be combined with the sliding unit to implement an SDFT computation. However, these algorithms have to perform the DFT for all of the time indices, even though the DFT output only needs to be calculated every L (L > 0) samples. The hopping DFT (HDFT) has thus been proposed to reduce the computational work-load [7, 8]. The HDFT algorithm computes the DFT by incorporating the successive DFT outputs with arbitrary-time hop L. However, an increase in the number of time hops implies that the calculation complexity and hardware costs also are increased. This is the major reason why a low-complexity and low-cost novel HDFT algorithm is developed in the current work.



**Fig. 1** Proposed novel HDFT algorithm (Proposed-I) a Overall architecture of rHDFT b L-point UVT

In this Letter, a new DFT algorithm, called recursive HDFT (rHDFT), is proposed to solve the above problems. The rHDFT not only reduces the complexity of the algorithm, but also reduces the hardware cost in implementations. The algorithm is derived in detail in this work, and is also mapped into a hardware accelerator. By some basic very-large-scale integration (VLSI) schemes, a low-cost and high-speed rHDFT hardware accelerator can be efficiently accomplished.

Algorithm derivation of the proposed HDFT: The N-point DFT computation with the input (x[n]) and output (X[k]) sequences is defined as follows:

$$X_m[k] = \sum_{n=0}^{N-1} x \left[ \widehat{m} + n \right] \times W_N^{-nk}, \tag{1}$$

where  $\widehat{m} = m - N + 1$ ,  $W_N = e^{j2\pi/N}$  and k = 0 to N - 1.

The transfer function of SDFT [4, 5] is further derived as a *z*-transform formula (2). A better expression for the SDFT is described in (3)

$$H_{\rm SDFT} = \frac{\left(1 - z^{-N}\right) W_N^k}{1 - W_N^k z^{-1}}.$$
 (2)

$$X_m[k] = W_N^k \times (X_{m-1}[k] + x[m] - x[m-N]).$$
(3)

For arbitrary-time hops, the HDFT relationship between  $X_m[k]$  and  $X_m$  $_{-L}[k]$  in [7] is derived by recursively substituting  $X_m[k]$  into  $X_{m-L}[k] L$  times in (3). The resulting formula is given by

$$X_{m}[k] = W_{N}^{Lk} \times (X_{m-L}[k] + D_{m}^{L}[k]).$$
(4)

The  $D_m^L[k]$  is defined as the *k*th bin of the *L*-point updating vector transform (UVT) and is expressed as (5)

$$D_m^L[k] = \sum_{n=0}^{L-1} d[m-n] W_N^{(n-L+1)k},$$
(5)

where d[m] = x[m] - x[m - N] and k = 0 to N - 1.

Equation (5) can be rewritten as (6) and the *z*-domain transfer function for the *k*th bin of the single-bin  $\tilde{D}_m^L[k]$  filter is derived into (7) by a recursive difference (8)

$$D_{m}^{L}[k] = W_{N}^{k} \times \left(\tilde{D}_{m}^{L}[k] - d[m - L]\right),$$
  
where  $\tilde{D}_{m}^{L}[k] = \sum_{n=0}^{L} \left(d[m - n]W_{N}^{-Lk}\right)W_{N}^{nk}.$  (6)

$$H_{\tilde{D}}(z) = \frac{1}{1 - W_N^k z^{-1}} \tag{7}$$

$$m[n] = m[n-1]W_N^k + \widehat{d}[n],$$
where  $m[-1] = 0$  and  $\widehat{d}[n] = d[n]W_N^{-Lk}.$ 
(8)



**Fig. 2** *Proposed low-critical-path rHDFT design (Proposed-II)* a Hardware architecture design of proposed rHDFT b Hardware architecture design of proposed UVT

Architecture design of the proposed HDFT: Fig. 1 shows the architecture design of the proposed novel rHDFT algorithm. It can be observed that the critical path of the proposed algorithm has  $3T_M + 4T_A$ , where  $T_M$ and  $T_A$  are the time periods of the multiplier and adder, respectively. After using some basic VLSI schemes and re-timing Fig. 1, a higherspeed rHDFT can be drawn, as shown in Fig. 2. Additionally, (i) the word-length of hardware tagged in Fig. 2 and (ii) the coefficients,  $W_N^{Lk}$  and  $W_N^{-Lk}$ , can be easily shared by 2's complement due to the symmetric identities of sine and cosine functions. Since the input sequence is a complex number, the computational complexity of the proposed rHDFT would require a total of three complex multiplications and four complex additions for the calculation of each frequency bin, after the first spectral component has been well calculated.

 Table 1: Computational complexity comparison for various N-point

 SDFT and HDFT algorithms applied in this hopping application

| Metrics                  | 2004 [5]       | 2010 [ <mark>6</mark> ] | 2014 [7]         | 2015 [ <mark>8</mark> ] | Proposed-I<br>and -II  |
|--------------------------|----------------|-------------------------|------------------|-------------------------|------------------------|
| Multiplication           | $I \times M$   | $(I+2)\times M$         | I ×M             | $(I+2)\times M$         | $4 \times M^{(I)}$     |
| Wattipfication           | Linn           | (L+2)~M                 | Linn             | $(L+2)^{-M}$            | 3*M <sup>(II)</sup>    |
| Addition                 | $2L \times M$  | $2L \times M$           | $(L+1) \times M$ | $(L+1) \times M$        | $4 \times M$           |
| Coefficient              | 2              | 4 <i>N</i> +2           | 2 <i>L</i>       | 4N + 2L - 2             | 4                      |
| Cri. Path (a)            | ( <i>L</i> +1, | (L+1, L                 | (I, 2)           | (1, 2)                  | (4, 3) <sup>(1)</sup>  |
| $(T_{\rm A}, T_{\rm M})$ | L)             | +1)                     | (L, 2)           | (L, 3)                  | (2, 1) <sup>(II)</sup> |
| Cat Cycle (b)            |                | $M^{(I)}$               |                  |                         |                        |
| Cpt. Cycle (0)           |                | $M + 2^{(II)}$          |                  |                         |                        |
| Cpt. Time                | (a)×(b)        |                         |                  |                         |                        |

Note: Cri. Path = critical path and Cpt. Cycle = computational cycle.

*Results and discussion:* With regard to the hopping issue of various SDFT and HDFT algorithms, the performance metrics in terms of numbers of multiplications, additions, and coefficients are shown in Table 1. Here, we assume that the transformation length (N) of DFT is 64, the time-domain frame size (M) is 256, the number of time hops (L) is four, and the input signal is a complex number for the kth bin of the single-bin. The results show that an SDFT-based HDFT algorithm such as Duda's [6] is more accurate than Jacobsen and Lyons' [4, 5] algorithms, but Duda's method has greater computational complexity and requires 1536 multiplications and 2048 additions, for the HDFT computation. On the other hand, Park and Ko's [7] HDFT requires a total of 1024 multiplications and 1280 additions. Compared with the latest HDFT approach [8], the number of multiplications and additions for the proposed algorithm is 768 and 1024, representing 50 and 20% reductions compared with Park *et al.*, respectively.

#### **Table 2:** Hardware cost comparison for various N-point SDFT and HDFT architecture designs applied in this hopping application

| Metrics     | 2004 [5] | 2010 [6]  | 2014 [7]   | 2015 [ <mark>8</mark> ] | Proposed-I and -II                    |
|-------------|----------|-----------|------------|-------------------------|---------------------------------------|
| Multiplier  | 4L       | 4L + 8    | 4L         | 4L + 8                  | 16 <sup>(I)</sup> /12 <sup>(II)</sup> |
| Adder       | 6L       | 6L+4      | 4L + 2     | 4L + 6                  | 16 <sup>(I)</sup> /14 <sup>(II)</sup> |
| Register    | N+L      | N + L + 1 | N + 2L - 1 | N + 3L - 1              | N + 2L + 1                            |
| Coefficient | 2        | 3         | 2L         | 2L                      | 4                                     |

Table 3: FPGA implementation results

| Hardware cost     |     | Hardware block (Cyclone IV E EP4CE115F29C7) |                                       |         |       |       |  |  |  |
|-------------------|-----|---------------------------------------------|---------------------------------------|---------|-------|-------|--|--|--|
|                   |     | Sliding unit                                | Sliding unit UVT unit Resonators unit |         | Other | Total |  |  |  |
| Combination       |     | 81                                          | 820                                   | 656     | 77    | 1634  |  |  |  |
| Registers         |     | 7                                           | 201                                   | 201 257 |       | 465   |  |  |  |
| DSP <sup>#1</sup> |     | 0                                           | 48                                    | 32      | 0     | 80    |  |  |  |
| DSP <sup>#2</sup> |     | 0                                           | 0 24 16                               |         |       |       |  |  |  |
| Mam               | RAM | 64×32 bits                                  |                                       |         |       |       |  |  |  |
| wiem              | ROM | 4×24 bits                                   |                                       |         |       |       |  |  |  |

Note: #1: embedded multiplier 9 bit elements and #2:  $18 \times 18$  simple multipliers.

Table 2 compares the hardware cost in terms of multipliers, adders, registers, and coefficient-ROM size. The results demonstrate that the proposed design has the lowest cost. It should be noted that one complex multiplier requires four real multipliers and two real adders for the hardware evaluation. There are only 12 real multipliers, 14 real adders, 73 registers, and four coefficients in the implementation. Compared with Wang *et al.* [8], the number of multipliers can be reduced by 50%. Furthermore, the quality of the Proposed-II is analysed by signal-to-quantization-noise ratio (SQNR), and the results show at least 100.7 dB under 256-time iteration loop. With regard to the hardware realisation, Table 3 shows the detailed FPGA implementation results for the Proposed-II structure, and indicates that the design can be operated at 47.26 MHz. Overall, the proposed design would be more suitable for real-time time–frequency spectrum applications due to its low complexity.

*Conclusion:* In this Letter, a low-cost and low-complexity rHDFT algorithm is developed and designed for time-frequency spectra calculation. The results show that the proposed method has better performance than other approaches. It is thus more powerful for use in future applications focusing on digital signal processing.

© The Institution of Engineering and Technology 2017 Submitted: *1 September 2016* E-first: *25 November 2016* doi: 10.1049/el.2016.3106

W.-H. Juang and C.-H. Luo (Department of Electric Engineering, National Cheng Kung University, Tainan 701, Taiwan)

S.-C. Lai (Department of Computer Science and Information Engineering, Nanhua University, Chiayi 622, Taiwan)

E-mail: shivan0111@nhu.edu.tw

K.-H. Chen (Department of Electric Engineering, National Chiao Tung University, Hsinchu 330, Taiwan)

W.-K. Tsai (Department of Electric Engineering, National Formosa University, Yunlin 632, Taiwan)

#### References

- 1 Oppenheim, A.V., and Schafer, R.W.: 'Discrete-time signal processing' (Prentice-Hall, Upper Saddle River, NJ, 2010, 3rd edn.)
- 2 Goertzel, G.: 'An algorithm for the evaluation of finite trigonometric series', Am. Math. Mon., 1958, 65, pp. 34–35, doi: 10.2307/2310304
- 3 Kober, V.: 'Fast algorithms for the computation of sliding discrete sinusoidal transforms', *Trans. Signal Process.*, 2004, **52**, (6), pp. 1704–1710, doi: 10.1109/TSP.2004.827184
- 4 Jacobsen, E., and Lyons, R.: 'The sliding DFT', *Signal Process. Mag.*, 2003, **20**, (2), pp. 74–80, doi: 10.1109/MSP.2203.1184347
- 5 Jacobsen, E., and Lyons, R.: 'An update to the sliding DFT', Signal Process. Mag., 2004, 21, (1), pp. 110–111, doi: 10.1109/ MSP.2204.1516381
- 6 Duda, K.: 'Accurate, guaranteed stable, sliding discrete Fourier transform', *Signal Process. Mag.*, 2010, 27, (6), pp. 124–127, doi: 10.1109/ MSP.2010.938088
- 7 Park, C.S., and Ko, S.J.: 'The hopping discrete Fourier transform', *Signal Process. Mag.*, 2014, **31**, (2), pp. 135–139, doi: 10.1109/ MSP.2013.2292891
- 8 Wang, Q., Yan, X., and Qin, K.: 'High-precision, permanently stable, modulated hopping discrete Fourier transform', *Signal Process. Lett.*, 2015, **22**, (6), pp. 748–751, doi: 10.1109/LSP.2014.2369518

# 科技部補助專題研究計畫出席國際學術會議心得報告

2016年6月29日

| 報告人姓名 | 賴信志                                                                                                                              | 服務單位<br>及職稱 | 資訊工程學系<br>助理教授 |  |  |
|-------|----------------------------------------------------------------------------------------------------------------------------------|-------------|----------------|--|--|
| 會議時間  | 5/28-6/01                                                                                                                        | 會議地點        | 沖繩/日本          |  |  |
| 會議名稱  | International Conference on Applied System Innovation                                                                            |             |                |  |  |
| 論文名稱  | Low-Cost and Low-Complexity Sliding Discrete Fourier Transform Based on a<br>Compact Recursive Structure for Real Input Sequence |             |                |  |  |

一、參加會議經過

參加這次 2016 International Conference on Applied System Innovation (ICASI, 2016) 國際 研討會主要是與成功大學羅錦興教授實驗室的學生-阮翌翔博士生一同,會議間我們特別遇到 了南台科技大學電機系的陳世中教授、崑山科技大學電腦與通訊系的吳崇民助理教授,並一 同分享討論了本次研討會的歷史與各自研究現況。尤其是陳世中教授鉅細靡遺地為我們介紹 本次 ICASI 的研討會是歷年來舉辦最為盛大的一場,在應用與創新的議題與技術開發上有其 特殊象徵意義。整個研討會的舉行共計有五天,其中有四天有滿滿的 keynote 演講與論文發 表,除此之外,我們亦在沖繩體驗了許多當地民情,以及參觀當地著名的博物館。

ICASI 2016 是由 Fuzhou 大學主辦,並且與 IEEE Tainan section 與虎尾科技大學等大學協辦。大會舉辦本研討會的主要的目的是希望藉由提供這樣的一個整合統一平台來讓相關領域的研究學者能夠針對資訊科技、創新技術研發、資訊通訊、工業化設計、以及電子電機工程等多個面向來進行無論是學術抑或是實際應用的跨領域交流。今年大會總共收到 1063 篇論文,經過嚴密的審查後,獲得接受的論文篇數約莫有 500+篇。因此大會將議程分為 26 場,其中一半的場次為邀請議程;另一半為常態議程。論文接受率為 50%左右,算是國際研討會中,接受率偏低的一個。

大會針對未來發展趨勢安排有四場 Keynote Speech,分別為: (1) Prof. James 針對 Achieving Systemic Change 進行演講; (2) Prof. M. Fatih 則分在科學、工程與設計上的 Imagination and Creativity; 以及國內極為著名的學者 (3) 張俊彥教授 針對 marriage of interdisciplinary research 提供許多面相的建議與剖析。

我們團隊獲得大會接受了兩篇論文並進行口頭報告,該論文議程 5/30 下午同一個場次中 進行口頭發表,我所發表的篇名為 "Low-Cost and Low-Complexity Sliding Discrete Fourier Transform Based on a Compact Recursive Structure for Real Input Sequence";成功大學電機所博 士生阮翌翔同學所發表的篇名為 "A Digitally Self-Calibration Method with Recursive DFT Algorithm for 12-bit SAR ADC Realization"。在這幾天的會議期間,我們也參加一些學術論文 發表的場次,並且交流討論了相關技術的發展與創新,期望能在未來能有更突破性的發展。

會議過程中也遇到了台灣來的學者,如:南台科技大學機械系陳沛仲副教授。陳教授在輔 具應用領域頗具盛名,他的研究領域為智慧型控制、模糊思考與機電整合,對於我與南華資 工系的學生們在指導上能提供跨思維的思考。由於我個人本身為電機背景領域,來到南華資 工系目前將滿三年,在這階段性的時刻能在這個研討會中有機會接觸到多位著名的學者,並 聽聆他們分享的經驗與成果,實在獲益不少,除了有激盪出新的想法外,也看到許多相關領 域的研究,特別是針對 ECG de-noise 的應用,與會間也認識了其他學校的朋友,交換了彼此 的聯繫方式。



### 二、與會心得

本次 IIHMSP 會議的主題共有 13 個常態議程,而我所感興趣的議程分別為:(1) Session E: Electrical and Electronic Engineering;(2) Session G: Green Technology and Architecture Engineering;(3) Session H: Innovation Design and Creative Design 等等。在這三個議程當中,有 三篇論文我特別感興趣,分別的編號是#0361(以小波為基礎的雜訊分臉算法設計)、#0184(針 對市內 LED 燈的熱循環利用系統),以及#0488(以腳踏車為基礎的即時資訊回饋系統設計)。 正巧智能電子與生醫電子生理信號處理等相關的應用,與無線傳輸技術的議題與我在南華大 學資工系目前所推動的相近,提供了我在思考未來學生專題製作設計上,多了一份推動大學 部同學努力來參加國際研討會的想法。而過去的我在研究上主要是針對積體電路設計進行快 速演算法的開發與信號處理的基本應用,在會議中不乏看見許多團隊的成果都已經走向跨領 域團隊的整合與應用,對於目前我遇到的研究方向的拿捏有些許的幫助。本次我所主報的論 文被安排在 5/30 下午的議程場次,主要的研究題目是針對滑動式的 DFT 進行介紹,其英文 論文的摘要如下。最後,也很意外的我們投稿的兩篇論文接獲得了大會的最佳論文獎的肯定, 其獎狀證明如附件。

# 【英文摘要】

This paper proposes a novel sliding discrete Fourier transform (DFT) algorithm and its architecture design to efficiently compute the information of time-frequency spectrum. Since the sliding process is adopted sample by sample, the spectral bin output data rate can be same to the input data rate. Under the conditions of M-sample real input sequence (M=256) and N-point recursive DFT computation (N=64), the proposed method has the following advantages: 1) Proposed-I requires less computational complexity than Krzysztof Duda's method does. The proposed algorithm greatly achieves 80.35% reduction in computation because the number of multiplication only takes 44,864 operations. Additionally, 54.91% of addition operations can be

saved; 2) For computing each frequency bin, the complexity of Proposed-I only requires 4 real additions and 2 real multiplications after the first spectral component has been finally calculated; 3) Proposed-II utilizes three registers and re-timing scheme to effectively shorten and balance the critical path of the proposed design. Moreover, the number of coefficients can be saved by 50% compared to Krzysztof Duda's method. In FPGA implementation, the proposed design can be operated at 43.5 MHz processing rate which can easily meet the requirement of real-time application. Therefore, it would be more suitable for real-time analysis of time-frequency spectrum in the future.



我的論文成果發表(A)



我的論文成果發表(B)

三、建議

建議學校應多鼓勵更多大學部的學生投入國際志工與參與國際研討會活動,提升學校在 國際相關領域的能見度。也鼓勵能與國內各實驗室能在校園內進行跨領域整合之研究。除了 促使學校與學校間更直接緊密的連接,繼而將合作成果以國際研討會發表方式進行成果發表。

四、攜回資料名稱及內容

ICASI 會議程序手冊、會議論文集 USB、高級提袋。



The Taiwanese Institute of Knowledge Innovation

附件二



# 科技部補助計畫衍生研發成果推廣資料表

日期:2017/02/02

|         | 計畫名稱: 符合ANSI S1.11規範之低群延遲、低複雜度與高規格的助聽輔具濾波器組演<br>算法設計與其晶片實現(II) |  |  |  |  |  |  |
|---------|----------------------------------------------------------------|--|--|--|--|--|--|
| 科技部補助計畫 | 計畫主持人: 賴信志                                                     |  |  |  |  |  |  |
|         | 計畫編號: 104-2221-E-343-003- 學門領域: 醫材系統與輔具系統                      |  |  |  |  |  |  |
|         |                                                                |  |  |  |  |  |  |
|         | 無研發成果推廣資料                                                      |  |  |  |  |  |  |
|         |                                                                |  |  |  |  |  |  |
|         |                                                                |  |  |  |  |  |  |
|         |                                                                |  |  |  |  |  |  |
|         |                                                                |  |  |  |  |  |  |
|         |                                                                |  |  |  |  |  |  |
|         |                                                                |  |  |  |  |  |  |

# 104年度專題研究計畫成果彙整表

|                                              |              |           |                          | •   |                    |    | · · · · · ·                                              |
|----------------------------------------------|--------------|-----------|--------------------------|-----|--------------------|----|----------------------------------------------------------|
| <b>計畫主持人:</b> 賴信志                            |              |           | 計畫編號:104-2221-E-343-003- |     |                    |    |                                                          |
| <b>計畫名稱:</b> 符合ANSI S1.11規範之低群延遲<br>晶片實現(II) |              |           | 、低複雜度                    | 與高規 | 現格的助聽輔具濾波器組演算法設計與其 |    |                                                          |
| 成果項目                                         |              |           |                          |     | 量化                 | 單位 | 質化<br>(說明:各成果項目請附佐證資料或細<br>項說明,如期刊名稱、年份、卷期、起<br>訖頁數、證號等) |
|                                              |              | 期刊論文      |                          |     | 0                  | kt |                                                          |
|                                              |              | 研討會論文     |                          | 0   | 扁                  |    |                                                          |
|                                              | 朗仁山八十        | 專書        |                          |     | 0                  | 本  |                                                          |
|                                              | 字何性論义        | 專書論之      | Ż                        |     | 0                  | 章  |                                                          |
|                                              |              | 技術報台      | 告                        |     | 0                  | 篇  |                                                          |
|                                              |              | 其他        |                          |     | 0                  | 篇  |                                                          |
|                                              |              |           | 这四声到                     | 申請中 | 0                  |    |                                                          |
|                                              |              | 專利權       | 贺听等利                     | 已獲得 | 0                  |    |                                                          |
| 國內                                           |              |           | 新型/設計                    | 專利  | 0                  |    |                                                          |
| 1,1                                          |              | 商標權       |                          |     | 0                  | 件  |                                                          |
|                                              | 智慧財產權<br>及成果 | 營業秘密      |                          |     | 0                  |    |                                                          |
|                                              |              | 積體電路電路布局權 |                          | 0   |                    |    |                                                          |
|                                              |              | 著作權       |                          |     | 0                  |    |                                                          |
|                                              |              | 品種權       |                          | 0   |                    |    |                                                          |
|                                              |              | 其他        |                          | 0   |                    |    |                                                          |
|                                              | 计化设施         | 件數        |                          | 0   | 件                  |    |                                                          |
|                                              | <b></b> 拉何移聘 | 收入        |                          | 0   | 千元                 |    |                                                          |
|                                              |              | 期刊論文      |                          | 2   | 答                  |    |                                                          |
|                                              |              | 研討會論文     |                          | 2   | 扁                  |    |                                                          |
|                                              | 舆华州公士        | 專書        |                          |     | 0                  | 本  |                                                          |
|                                              | 子侧住丽义        | 專書論之      | Σ.                       |     | 0                  | 章  |                                                          |
|                                              |              | 技術報台      | 告                        |     | 0                  | 篇  |                                                          |
|                                              |              | 其他        |                          | 0   | 篇                  |    |                                                          |
| 國                                            |              |           | <b>孫</b> 田 重 利           | 申請中 | 0                  |    |                                                          |
| 外                                            |              | 專利權       | 專利權 發明專利 已獲得             | 0   |                    |    |                                                          |
|                                              |              |           | 新型/設計                    | 專利  | 0                  |    |                                                          |
|                                              | 智慧財產權        | 慧財產權 商標權  |                          | 0   | 件                  |    |                                                          |
|                                              | 及成果          | 这果 營業秘密   |                          | 0   |                    |    |                                                          |
|                                              |              | 積體電路      | 各電路布局                    | 權   | 0                  |    |                                                          |
|                                              |              | 著作權       |                          | 0   |                    |    |                                                          |
|                                              |              | 品種權       |                          |     | 0                  |    |                                                          |

|                                                                                            |      | 其他     | 0 |    |  |
|--------------------------------------------------------------------------------------------|------|--------|---|----|--|
|                                                                                            | 技術移轉 | 件數     | 0 | 件  |  |
|                                                                                            |      | 收入     | 0 | 千元 |  |
|                                                                                            |      | 大專生    | 8 |    |  |
|                                                                                            |      | 碩士生    | 0 |    |  |
|                                                                                            | 本國籍  | 博士生    | 1 |    |  |
| 參曲                                                                                         |      | 博士後研究員 | 0 |    |  |
| 丹<br>計                                                                                     |      | 專任助理   | 0 | 人次 |  |
| 畫                                                                                          | 非本國籍 | 大專生    | 0 |    |  |
| 人<br>  カ                                                                                   |      | 碩士生    | 0 |    |  |
|                                                                                            |      | 博士生    | 0 |    |  |
|                                                                                            |      | 博士後研究員 | 0 |    |  |
|                                                                                            |      | 專任助理   | 0 |    |  |
| 其他成果<br>(無法以量化表達之成果如辦理學術活動<br>、獲得獎項、重要國際合作、研究成果國<br>際影響力及其他協助產業技術發展之具體<br>效益事項等,請以文字敘述填列。) |      |        |   |    |  |

# 科技部補助專題研究計畫成果自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價值(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性)、是否適 合在學術期刊發表或申請專利、主要發現(簡要敘述成果是否具有政策應用參考 價值及具影響公共利益之重大發現)或其他有關價值等,作一綜合評估。

| 1 | <ul> <li>請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估</li> <li>■達成目標</li> <li>□未達成目標(請說明,以100字為限)</li> <li>□實驗失敗</li> <li>□因故實驗中斷</li> <li>□其他原因</li> <li>說明:</li> </ul>                                                                                                                                                                        |
|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 2 | <ul> <li>A. 研究成果在學術期刊發表或申請專利等情形(請於其他欄註明專利及技轉之證號、合約、申請及洽談等詳細資訊)</li> <li>論文:■已發表 □未發表之文稿 □撰寫中 □無專利:□已獲得 □申請中 ■無<br/>技轉:□已技轉 □洽談中 ■無<br/>其他:(以200字為限)</li> </ul>                                                                                                                                                                   |
| 3 | B. 請依學術成就、技術創新、社會影響等方面,評估研究成果之學術或應用價值<br>(簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性,以500字<br>為限)<br>本計畫近程將以演算法開發為主,同時搭配FPGA與客製化晶片IC實現。關於濾<br>波器組的演算法的部分實現後,提供相關模組給中山醫學大學聽語系的郭明雯<br>教授與羅敦信老師,進行成果效能的評斷與試驗,以利後續臨床實驗與未來合<br>作。目前所研發出的演算法與硬體共計有Quasi-ANSI S1.11 class-2、ANSI<br>S1.11 class-2、以及ANSI S1.11 class-0,相信對於未來助聽輔具之發展將<br>能導引更先進的核心技術。 |
| 4 | <ul> <li>主要發現</li> <li>本研究具有政策應用參考價值:■否 □是,建議提供機關</li> <li>(勾選「是」者,請列舉建議可提供施政參考之業務主管機關)</li> <li>本研究具影響公共利益之重大發現:■否 □是</li> <li>說明:(以150字為限)</li> </ul>                                                                                                                                                                          |