skip navigation
ERPAePRINTS  logo: erpanet skip navigation

Automating Metadata Extraction: Genre Classification

Kim, Dr Yunhyong and Ross, Prof Seamus (2006) Automating Metadata Extraction: Genre Classification. In Proceedings UK e-Science All Hands Meeting 2006: Achievements, Challenges, and New Opportunities, Nottingham.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.


A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.

Item Type:Conference Paper
Subjects:E Data Description, Documentation and Standards > EE Description
C Strategies and Procedures > CG Harvesting
P Curation Issues
E Data Description, Documentation and Standards > EA Metadata
E Data Description, Documentation and Standards
Document Language:English
ID Code:111
Deposited By:Ross, Professor Seamus
Deposited On:18 October 2006