A Hybrid Approach to Vietnamese Word Segmentation Using Part of Speech Tags

Pham, Dang Duc; Tran, Giang Binh; Pham, Son Bao

doi:10.1109/kse.2009.44

Cited by 31 publications

(13 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The transition graph will show the probability among syllables to form the words in a specific text. Nguyen et al, 2003;Pham et al, 2009 used this method to segment Vietnamese word [3,4].…”

Section: B Transition Graph Methodsmentioning

confidence: 99%

Dynamic Programming Method Applied in Vietnamese Word Segmentation Based on Mutual Information among Syllables

Uyen¹,

Sang²

2014

ijarai

View full text Add to dashboard Cite

Abstract-Vietnamese word segmentation is an important step in Vietnamese natural language processing such as text categorization, text summary, and automated machine translation. The problem with Vietnamese word segmentation is complicated because Vietnamese words are not always separated by a space. One word can include one or more syllables depending on the context. This paper proposes a method for Vietnamese word segmentation based on the mutual information among the syllables combined with dynamic programming. With this method, we can achieve an accuracy rate of about 90% with a raw text corpus.

show abstract

“…The transition graph will show the probability among syllables to form the words in a specific text. Nguyen et al, 2003;Pham et al, 2009 used this method to segment Vietnamese word [3,4].…”

Section: B Transition Graph Methodsmentioning

confidence: 99%

Dynamic Programming Method Applied in Vietnamese Word Segmentation Based on Mutual Information among Syllables

Uyen¹,

Sang²

2014

ijarai

View full text Add to dashboard Cite

show abstract

“…As the case satisfies the conditions of the rules at nodes (3), (5) and (40), it is passed to node (42), using except edges. Since the case does not satisfy the conditions of the rules at nodes (42), (43) and (45), we have the evaluation path (0)-(1)-(2)-(3)-(5)-(40)-(42)-(43)-(45) with the last fired node (40). Given another case of "In which projects is enrico motta working on", it satisfies the conditions of the rules at nodes (0), (1) and (2); as node (2) has no except child node, we have the evaluation path (0)-(1)-(2) and the last fired node (2).…”

Section: Single Classification Ripple Down Rulesmentioning

confidence: 99%

Ripple Down Rules for question answering

Nguyen

Pham

2017

Self Cite

View full text Add to dashboard Cite

Recent years have witnessed a new trend of building ontology-based question answering systems. These systems use semantic web information to produce more precise answers to users' queries. However, these systems are mostly designed for English. In this paper, we introduce an ontology-based question answering system named KbQAS which, to the best of our knowledge, is the first one made for Vietnamese. KbQAS employs our question analysis approach that systematically constructs a knowledge base of grammar rules to convert each input question into an intermediate representation element. KbQAS then takes the intermediate representation element with respect to a target ontology and applies concept-matching techniques to return an answer. On a wide range of Vietnamese questions, experimental results show that the performance of KbQAS is promising with accuracies of 84.1% and 82.4% for analyzing input questions and retrieving output answers, respectively. Furthermore, our question analysis approach can easily be applied to new domains and new languages, thus saving time and human effort.

show abstract

“…Tuỳ vào cách thức so khớp mà ta có các phƣơng pháp khác nhau nhƣ: so khớp từ dài nhất (longest matching), so khớp từ ngắn nhất (short matching), so khớp chồng lắp (overlap matching) và so khớp cực đại (maximum matching) (Dinh et al, 2001), (Pham et al, 2009). Độ chính xác của phƣơng pháp dựa trên từ điển phụ thuộc rất lớn vào kích thƣớc từ điển đƣợc xây dựng.…”

Section: A Tiếp Cận Dựa Trên Từ đIểnunclassified

Sự Ảnh Hưởng Của Phương Pháp Tách Từ Trong Bài Toán Phân Lớp Văn Bản Tiếng Việt

Khang¹,

Thư²,

Phi³

et al. 2017

Fair - Nghiên Cứu Cơ Bản Và Ứng Dụng Công Nghệ Thông Tin - 2016

View full text Add to dashboard Cite

Từ khóa-Tách từ, phương pháp tách từ tiếng Việt, xử lý ngôn ngữ tự nhiên, phân lớp văn bản. I. GIỚI THIỆUVới sự phát triển nhanh chóng của công nghệ thông tin, nguồn thông tin trực tuyến (online) dƣới dạng văn bản xuất hiện càng ngày càng nhiều. Nguồn thông tin này đến từ các thƣ viện điện tử, thƣ điện tử, trang web, hệ thống tìm kiếm và tra cứu thông tin. Việc khám phá tri thức tiềm ẩn từ kho dữ liệu văn bản là cần thiết cho việc quản lý, khai thác hiệu quả nguồn thông tin văn bản khổng lồ này. Phân lớp văn bản (text categorization) là một trong những kỹ thuật chính để xử lý và tổ chức dữ liệu văn bản. Kỹ thuật phân lớp văn bản đƣợc dùng để gán nhãn tự động các bản tin, sắp xếp tổ chức email hay tập tin, nhận dạng thƣ rác. Có để định nghĩa ngắn ngọn bài toán phân lớp văn bản nhƣ sau: gán nhãn cho từng văn bản theo chủ đề đã đƣợc định nghĩa trƣớc dựa vào nội dung của văn bản. Phân lớp văn bản thƣờng đƣợc dựa trên mô hình ngữ nghĩa hoặc máy học. Tuy nhiên nhƣ bài phỏng vấn đƣợc thực hiện bởi M. Lucas (Tạp chí Mappa Mundi) năm 1999, M. Hearst cho rằng tiếp cận ngữ nghĩa là vấn đề rất khó, phức tạp. Vì vậy, tiếp cận dựa trên máy học tự động lại đơn giản và cho nhiều kết quả tốt trong thực tiễn. Hầu hết các phƣơng pháp phân loại văn bản dựa trên mô hình thống kê từ và các giải thuật máy học phân lớp (Dumais et al., 1998), (Sebastiani, 1999), (Manning et al., 2008).Bƣớc đầu tiên trong phân lớp văn bản là biến đổi văn bản từ chuỗi ký tự về dạng phù hợp với các giải thuật học máy. Đặc điểm chung của nguồn dữ liệu văn bản là không có cấu trúc (độ dài khác nhau) trong khi đa số các giải thuật đòi hỏi dữ liệu huấn luyện phải có cấu trúc (chiều dài các véc-tơ đặc trƣng phải giống nhau chẳng hạn). Các nghiên cứu trong lĩnh vực truy vấn thông tin đã chỉ ra rằng thứ tự của các từ trong văn bản đóng vai trò không quan trọng lắm đối với hầu hết các bài toán phân tích, xử lý dữ liệu văn bản (Joachims, 1999). Chính vì thế mô hình túi từ (Salton et al., 1975) là một mô hình phổ biến cho biểu diễn dữ liệu văn bản. Theo mô hình này, mỗi từ (khác nhau) trong văn bản sẽ là một đặc trưng (feature) và tần số xuất hiện của nó trong văn bản là giá trị của đặc trƣng tƣơng ứng. Quá trình trích đặc trƣng bao gồm tách từ (word segmentation) và đếm số lần xuất hiện của các từ trong văn bản. Nhƣ thế, văn bản sẽ đƣợc biểu diễn dƣới dạng véc-tơ tần số.Bƣớc tiếp theo là huấn luyện mô hình học tự động từ bảng dữ liệu này. Các mô hình máy học thƣờng sử dụng nhƣ giải thuật k-NN (Fix & Hodges, 1952), naive Bayes (Good, 1965), cây quyết định (Quinlan, 1993), (Breiman et al., 1984), máy học véc-tơ hỗ trợ (Vapnik, 1995), giải thuật tập hợp mô hình bao gồm Boosting (Freund & Schapire, 1995), (Breiman, 1998) và rừng ngẫu nhiên (Breiman, 2001). Các nghiên cứu về máy học trƣớc đây của (Phạm et al., 2006), (Phạm et al., 2008), (Đỗ, 2012), (Đỗ & Phạm, 2013) đề xuất các giải thuật máy học dựa trên tập hợp mô hình, máy học véc-tơ hỗ trợ, naive Bayes, cho phép phân lớp hiệu quả các tập dữ liệu có số chiều lớn nhƣ biểu diễn văn bản bằng mô hình túi từ.Đối với các ...

show abstract

A Hybrid Approach to Vietnamese Word Segmentation Using Part of Speech Tags

Cited by 31 publications

References 7 publications

Dynamic Programming Method Applied in Vietnamese Word Segmentation Based on Mutual Information among Syllables

Dynamic Programming Method Applied in Vietnamese Word Segmentation Based on Mutual Information among Syllables

Ripple Down Rules for question answering

Sự Ảnh Hưởng Của Phương Pháp Tách Từ Trong Bài Toán Phân Lớp Văn Bản Tiếng Việt

Contact Info

Product

Resources

About