摘要: | 中文全文資訊檢索之研究,因中文語言的本質與特徵,所以起步比西文資訊檢索的研究較晚,目前中文全文資訊檢索研究的困難點,在於缺乏一個具有科學實驗效度研究環境與效能評量的標準。本研究分別採用中央研究院的中文全文檢索系統(CTP;1996)和黃雲龍(1997)的群集索引模型系統,來探討中文全文資訊檢索效能評量。 就當前中文資訊檢索研究發展而言,本研究希望有下面三項貢獻: 1.探討不同模型的檢索效能之比較研究。 2.發展更成熟的中文檢索系統之群集索引模型,以提供應用研究的基礎。 3.提供中文資訊檢索可行的研究實驗環境(如:實驗平臺、評量標準與程序規劃)。 實驗利用民國八十二年兒童日報醫藥新聞(502篇文件),使用21個查詢句,在各項實驗與研究控制下,檢索效能評量的實驗結果歸納如下所述: 1.全文檢視模型,平均檢出率為99.02%,平均精確率為17.72%。 2.群集索引模型,在自動選詞環境下,以群集索引構面100,相似性界限值0.3為條件下: (1)群集索引模型─IDF加權模式,檢出率為80.73%,精確率為45.09%。 (2)群集索引模型─TF加權模式,檢出率為65.97%,精確率為43.52%。 3.群集索引模型,在人工選詞環境下,以群集索引構面100,相似性界限值0.3為條件下: (1)群集索引模型─IDF加權模式,檢出率為82.81%,精確率為47.11%。 (2)群集索引模型─TF加權模式,檢出率為64.81%,精確率為42.72%。 4.經由上述實驗結果,提出以下發現: (1)群集索引模型─IDF加權模式,在自動選詞與人工選詞下,檢索效能均優於全文檢視模型。 (2)群集索引模型─IDF加權模式的檢出率顯著優於群集索引模型─TF加權模式,而精確率並沒有顯著差異。 (3)群集索引模型,當群集索引構面愈多時,自動選詞與人工選詞的檢索效能,並沒有顯著差異。換言之,索引詞彙可以運用自動斷詞所產生的詞彙來替代。 資訊檢索相關研究已經有一段很長的時間,各種演算法與理論不斷推陳出新,在系統上也不斷改良,但是仍然沒有一個系統能真正滿足使用者,未來的系統必須能夠以各種方式來檢索資料,甚至於相關回饋時能夠運用不同模式來檢索。 另外,中文資訊檢索的研究涉及了許多議題,但是對於不同模式的檢索系統之間效能評量的研究卻少之又少,中文資訊檢索的研究發展,倘若能建構一個標準的評估環境,如大型文件集、查詢句、相關判斷、標準評量模式等,對未來研究若能在標準的評估環境下實驗,將有助於系統機制的發展與改善檢索效能。 Full-Text Information Retrieval is becoming an interdisciplinary interest. Mandarin Chinese Full-Text Information Retrieval is facing more basic difficulties than English context because of research lag and language nature. Lack of an objective test collection and a standard effectiveness evaluation for information retrieval experiments is the fundamental issue for Mandarin Chinese Full-Text information retrieval. In this thesis, we will introduce two different systems, including the Chinese Text Processor (CTP) developed by Academia Sinica in 1996, and the Cluster Indexing Model (CIM) developed by Huang Yun-Long in 1997. Also we will use same corpus (documents set), to evaluate system performance. Concerning the research status in Chinese, this research will have three contributions. First, analysis the fitness method of Full-Text Information Retrieval in same corpus or documents set. Second, developing a mature Cluster Indexing Model as the fundamental of advance application researches. Finally, this project will construct test collections and a standard effectiveness evaluation for Full-Text Information Retrieval researches in Chinese. Involving with medicine of Children’s Daily News (502 documents) and 21 queries. Under a series of experiments, the following conclusions are discovered: 1.The average recall of CTP is 99.02%, and its average precision is 17.72%. 2.In automatic term segmentation methods, under index dimension 100 and similarity threshold 0.3: (1)The recall of CIM-IDF is 80.73%, and the precision is 45.09%. (2)The recall of CIM-TF is 65.97%, and the precision is 43.52%. 3.In manual term segmentation methods, under index dimension 100 and similarity threshold 0.3: (1)The recall of CIM-IDF is 82.81%, and the precision is 47.11%. (2)The recall of CIM—TF is 64.81%, and the precision is 42.72%. 4.According to the results of above experiments, the following conclusions are discovered: (1)The performance of CIM-IDF is better than CTP in automatic and manual term segmentation. (2)The performance of CIM-IDF is better than CIM—TF in automatic and manual term segmentation. (3)In CIM-IDF, when index dimension greater than 80, the results show that the performance of automatic and manual term segmentation are similar. It showed clearly that automatic term segmentation methods could substitute for manual. Many researchers have devoted to developing information retrieval systems for a long time. They are find new ways of doing things from different theories and improve system of performance, but not any one system can by satisfy. However, The IR system should support different retrieval models, and relevance feedback can use to differ model in the future. Besides, research has involved many topics for discussion in Mandarin Chinese Full-Text information retrieval. However, it was lack of effectiveness evaluation in diverse information retrieval. If research could construct a standard of evaluation environment (ex. large corpus, query, relevance judgment, and a standard of evaluation), it will improve system of performance to contributive. |