RISS 검색 - 학위논문 상세보기

국문 초록 (Abstract)

본 연구에서는 무형유산의 활용을 위하여 무형유산 문서의 시대적 분석 시스템을 제안하였다. 이 시스템은 본 연구에서 새롭게 제안한 문서 특징 추출 기법과 문서 군집화 기법 그리고 기 ...

본 연구에서는 무형유산의 활용을 위하여 무형유산 문서의 시대적 분석 시스템을 제안하였다. 이 시스템은 본 연구에서 새롭게 제안한 문서 특징 추출 기법과 문서 군집화 기법 그리고 기 연구 및 발표된 CNN(Convolutional Neural Network) 기반 텍스트 분류 기법 등, 크게 3가지로 구성되며 연구 과정은 다음과 같다.
첫째, 워드넷 유사도가 포함된 문장 신호화와 이산 푸리에 변환을 적용한 문서 특징 추출 기법을 제안하였으며 성능 입증을 위해 표절 문장 탐색에 적용하였다. 실험은 표절 탐색 분야의 저명한 워크숍인 PAN에서 제공하는 공식적인 데이터 셋(2013-corpus)을 사용하였다. 실험 결과, 기 발표된 11개의 표절 문장 탐색 기법 중 4번째로 우수한 성능을 보였다. 특히, 벡터 공간 모델 기반 표절 문장 탐색 기법보다 약 5.1%~27.4% 성능이 향상 되었다. 이러한 연구 결과는 이산 푸리에 변환을 적용한 특징 추출 기법을 문장에서 문서로 확장할 수 있음을 보여준다.
둘째, 문서 군집화를 위해 본 연구에서는 유전자 알고리즘(GA, Genetic Algorithms)과 입자 군집 최적화(PSO, Particle Swarm Optimization)의 탐색 능력을 결합한 군집화 앙상블 기법을 제안하였다. 실험을 위해 Reuters-21578 및 20-Newsgroups 문서 집단을 4개의 하위 데이터 셋으로 나누었으며 F-measure를 통해 3가지 경우 (best-case, worst-case, average)에 대한 성능을 비교하였다. 실험 결과, 제안된 기법이 3가지 모든 경우에서 다른 문서 군집화 기법보다 높은 성능을 보였다. 또한, GA와 PSO 군집화 앙상블 기법에 이산 푸리에 변환 기반 문서 특징 추출 기법을 적용하였을 경우 4개의 데이터 셋에 대해 약 2.27%, 5.41%, 2.07%, 4.94% 성능이 향상 되었다. 특히, 더 많은 색인어를 가진 데이터 셋과 서로 관련된 용어로 이루어진 데이터 셋에서 높은 성능 향상을 보였다.
마지막으로, 제안된 기법들과 기 연구 및 발표된 CNN을 통한 텍스트 분류 기법을 활용하여 무형유산 문서의 시대적 분석 시스템을 제안하였다. 본 연구에서는 한국 역사에 관련된 문서들을 삼국시대, 고려, 조선 3개의 시대 분류에 초점을 맞추었다. 시대별 문서를 제공하는 기관(한국사 LOD, 한국민속대백과사전, 온라인한국민족문화대백과사전)에서 문서를 수집하여 학습 데이터를 구성하고 CNN을 통해 시대 분류 모델을 생성하였으며 실험 결과 86%의 분류율을 보였다. 또한, 국내외 있어 가장 활발히 운영되고 있는 무형유산 온라인 목록화 시스템인 이치피디아에서 설화 및 전설 중심으로 무형유산 문서들을 CNN을 통해 시대적으로 분류하고 제안한 GA와 PSO 군집화 앙상블 기법을 통하여 시대별 주제 흐름 분석을 시도한 결과 인물과 자연/동물/기타에 대한 시대적 흐름을 파악할 수 있었다.

다국어 초록 (Multilingual Abstract)

In this study, we propose a chronological analysis system of intangible cultural heritage text documents. This system consists of three methods: two methods, a text document feature extraction and a text document clustering, are newly proposed in this study, the other CNN (Convolutional Neural Network) for text classification. The detailed research steps are as follows.
Firstly, a method of text document feature extraction using a Discrete Fourier Transform (DFT) with sentence signaling including WordNet similarity measure is proposed and applied to detecting plagiarized sentences. The data set, 2013-Corpus, provided by PAN which is the one of well-known workshops for text plagiarism is used in our experiments. Our method is fourth ranked among the eleven most outstanding plagiarism detection methods. Especially, our method shows performance improvements in the detection of plagiarized sentences by 5.1% to 27.4% compared to the plagiarized sentence detection methods based on the vector space model based. This results show this feature extraction method for sentences can be extended for documents.
Secondly, we propose an ensemble clustering method that combines both GA (Genetic Algorithms) and PSO (Particle Swarm Optimization) for text document clustering in order to properly use their optimization algorithm features. To test the effectiveness of our method, we conduct experiments on four subsets of standard Reuters-21578 and 20 Newsgroups datasets. We also compare our method with three cases (best-case, worst-case, and average) using F-measure. The experimental results show that our proposed method achieves better performance than other text document clustering algorithms in all the three cases. Moreover, when a DFT-based text document feature extraction is applied to the GA and PSO ensemble clustering method, the performances are improved in four data sets by 2.27%, 5.41%, 2.07%, and 4.94%. In particular, it shows the high performance in the datasets that have more indexed terms which are closely related to each other’s.
Finally, we propose a chronological analysis system of intangible cultural heritage text documents by utilizing the proposed methods for text document feature extraction and text document clustering, and CNN for text classification. We focus on classifying Korean historical documents into three era such as Three States (Silla, Goguryeo, Baekje), Goryeo dynasty, and Joseon dynasty. The training data set is collected from Korean history LOD, Korean folk culture encyclopedia, Korean culture encyclopedia and etc. we create the chronological classification model trough CNN document classification method. The result shows that our method performs with 86% accuracy. In addition, the intangible cultural heritage text documents of Korean folk tales and legends in ICHPEDIA, the most active intangible cultural heritage online inventory system in domestic and foreign countries, are classified by chronological criteria using CNN. We analyze the chronologically classified documents applying the proposed GA and PSO ensemble clustering method. As a result, we can identify a chronological topic trends related to people and nature/animal/others.

목차 (Table of Contents)