skip navigation
ERPAePRINTS  logo: erpanet skip navigation

Formulating representative features with respect to document genre classification

Kim, Dr Yunhyong and Ross, Prof Seamus (2008) Formulating representative features with respect to document genre classification. LDV Forum 23(2).

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

Genre classification (e.g. whether a document is a scientific article or magazine article) is closely bound to the physical and conceptual structure of document as well as the level of depth involved in the text. Hence, it provides a means of ranking documents retrieved by search tools according to metrics other than topical similarity. Moreover, the structural information derived from genre classification can be used to locate target information within the text. In previous studies, the detection of genre classes has been attempted by using some normalised frequency of terms or combinations of terms in the document (here, we are using term as a reference to words, phrases, syntactic units, sentences and paragraphs, as well as other patterns derived from deeper linguistic or semantic analysis). These approaches largely neglect how the term is distributed throughout the document. Here, we report the results of automated experiments based on distributive statistics of words in order to present evidence that term distribution pattern is a better indicator of genre class than term frequency.

Item Type:Journal (On-line/Unpaginated)
Keywords:document classification · genre · document representation · word distribution
Subjects:M Resource Discovery
L Digital Repository, Digital Archive and Digital Library Models > LA Ingest
E Data Description, Documentation and Standards > EA Metadata
Document Language:English
ID Code:154
Deposited By:Kim, Dr Yunhyong
Deposited On:19 November 2008