|
Sign In to gain access to subscriptions and/or personal tools.
|
Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services
Zhang Zhixiong
Information System Department, Library of the Chinese Academy of Sciences (LCAS), Information Technology Section of IFLA, zhangzhx{at}mail.las.ac.cn, Digital Library Research and Development of the Library Society of China, Graduate University of the Chinese Academy of Sciences
Li Sa
Library of Chinese Academy of Sciences, majoring in information extraction, adam.li{at}sap.com
Wu Zhengxin
Information System Department of the Library of the Chinese Academy of Sciences, wuzx{at}mail.las.ac.cn
Lin Ying
Library of Chinese Academy of Sciences, liny{at}lib.bnu.edu.cn
Being aware of the importance of Information Extraction (IE) in supporting innovation in many areas of library services, the authors began to construct a Chinese information extraction system to effectively process huge Chinese information resources. The authors bring forth a Chinese IE solution which makes full use of the GATE (General Architecture for Text Engineering) system from the University of Sheffield, trying to develop a Chinese IE plug-in to process Chinese information resources based on the GATE framework. The article analyses the framework of the GATE system, describes the Chinese IE solution based on the GATE system and focuses on three key difficulties in the process of implementing a Chinese information extraction system. These are: 1. Chinese tokenizing problem; 2. professional gazetteers; 3. Chinese named entity recognition. The authors have successfully implemented this system and carried out an experiment in which the Chinese IE system successfully extracted thousands of pieces of science and technology news. The authors believe this system is a significant trial and lays a good foundation for future research work.
Key Words: Information extraction Chinese language natural language processing General Architecture for Text Engineering GATE innovation
References
- CSDL, Chinese National Science Digital Library, http://www.csdl.ac.cn/ [accessed May 8, 2006]
- Natural Language Processing Research Group at the University of Sheffield. Information extraction. http://nlp.shef.ac.uk/research/areas/ie.html [accessed May 8, 2006]
- Hamish Cunningham. Information extraction, automatic. Encyclopedia of Language & Linguistics, 2nd Edition. 2005, http://gate.ac.uk/sale/ell2/ie/main.pdf [accessed May 8, 2006]
- Nist. Muc. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html [accessed May 8, 2006]
- Nist. TIPSTER Text Program, http://www.-nlpir.nist.gov/related_projects/tipster/ [accessed May 8, 2006]
- Nist. Ace — Automatic Content Extraction, http://www.nist.gov/speech/tests/ace/ [accessed May 8, 2006]
- Zhang Zhixiong. Information extraction and its functions in the digital library. New Technology of Library and Information Service, 2004(6): 1—5, 23.
- Open University. MnM. http://kmi.open.ac.uk/projects/akt/MnM/ [accessed May 8, 2006 ]
- Siegfried Handschuh, Steffen Staab and Fabio Ciravegna. S-CREAM: Semi-automatic CREAtion of Metadata, In Proceedings of the European Conference on Know-ledge Acquisition and Management — EKAW—2002. Madrid, Spain, October 1—4, 2002. Springer, 2002 http://www.aifb.uni-karlsruhe.de/WBS/sst/Research/Publications/ekaw2002scream-sub.pdf [accessed May 8, 2006]
- Paul Kogut, William Holmes. AeroDAML: Applying information extraction to generate DAML annotations from web pages. K-CAP 2001 Workshop Knowledge Markup & Semantic Annotation, October 21, 2001, Victoria BC, Canada. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/AeroDAML3.pdf [accessed May 8, 2006]
- Stephen Dill et al. SemTag and Seeker: bootstrapping the semantic web via automated semantic annotation. Twelfth International World Wide Web Conference, 20—24 May 2003, Budapest, Hungary. http://www.2003.org/cdrom/papers/refereed/p831/p831-dill.html [accessed May 8, 2006]
- Atanas Kiryakov, Borislav Popov, Ivan Terziev, Dimitar Manov, Damyan Ognyanoff. Semantic annotation, indexing, and retrieval. Elsevier's Journal of Web Sematics, Vol. 2, Issue (1), 2005. http://www.websemanticsjournal.org/ps/pub/2005-10 [accessed May 8, 2006]
- Project hTechsight. http://www.etse.urv.es/~drianyo/hTechSight/projecte.html [accessed May 8, 2006]
- Gate, a General Architecture for Text Engineering. http://gate.ac.uk/ [accessed May 8, 2006]
- Paul Buitelaar, Philipp Cimiano, Stefania Racioppa, Melanie Siegel. Ontology-based information extraction with SOBA, http://www.dfki.de/~paulb/lrec2006.SmartWeb.pdf [accessed May 8, 2006]
- Rohini Srihari and Wei Li Information extraction supported question answering. http://trec.nist.gov/pubs/trec8/papers/cymfony.pdf [accessed May 8, 2006]
- KEA project. http://www.nzdl.org/Kea/ [accessed May 8, 2006]
- ANP (Arizona Noun Phraser). http://ai.bpa.arizona.edu/research/multilingual/az.htm [accessed May 8, 2006]
- TIES (Trainable Information Extraction System). http://tcc.itc.it/research/textec/tools-resources/ties.html [accessed May 8, 2006]
- Hamish Cunningham et al. Developing language pro-cessing components with GATE Version 3 (a user guide). http://gate.ac.uk/sale/tao/index.html [accessed May 8, 2006]
- Diana Maynard. Introduction to ANNIE. March 2004. http://gate.ac.uk/sale/talks/annie-tutorial.ppt [accessed May 8, 2006]
IFLA Journal, Vol. 33, No. 4,
340-350 (2007)
DOI: 10.1177/0340035207086064

CiteULike Complore Connotea Del.icio.us Digg Reddit Technorati Twitter What's this?
|
|