Text mining, also known as intelligent text analysis, text data mining , unstructured data management, or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge (usually converted to metadata elements) from unstructured text (i.e. free text) stored in electronic form. This can be achieved either through added markup in xml, Atom or RDF formats or though the analysis of common phraseologies indicating certain relationships.
History
Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an
interdisciplinary field which draws on
information retrieval,
data mining,
machine learning,
statistics, and
computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
Applications
Recently, text mining has been receiving attention in many areas, most notably in the security, commercial, and academic fields.
Security applications
One of the largest text mining applications that exists is probably the classified
ECHELON surveillance system.
Commercial applications
Research and development departments of major companies, including
IBM and
Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes.
Academic applications
The issue of text mining is of importance to publishers who hold large
databases of information requiring
indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been begun such as
Nature's proposal for an
open text mining interface (OTMI) and
NIH's common Journal Publishing
Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.
Academic institutions have also become involved in the text mining initiative: The National Centre for Text Mining (NaCTeM), a collaborative effort between the Universities of Manchester, Liverpool and Salford, funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils aim to provide tools, carry out research and offer advice to the academic community, with an initial focus on text mining in the biological and biomedical sciences. In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis.
Implications
Until recently websites mostly used text-based lexical searches. Text mining will enable searches which can be directly answered by the
semantic web.
External links
- UIMA standard An Open, Industrial-Strength Platform for Unstructured Information Analysis and Search
- Text-Mining.org Good reference of the text mining community
- Kmining List of text mining, data mining and KDD scientific conferences
- Text mining summit 2006
- unstruct.org Latest news about the industry
- Topicalizer - A text analysis tool
- Text analysis tool
- Textengines Quick guide: Text analysis explained
- YALE (Yet Another Learning Environment): free open-source software for knowledge discovery, data mining including text mining, machine learning, etc.: YALE and its also freely available open-source plugin WordVectorTool offer a free complete software environment for many text mining tasks
- GATE (A General Architecture for Text Engineering): free open-source software for natural language processing (NLP), text mining, and information extraction, that partially uses YALE
- Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering.
- Text mining: Science Digs Deeper.
See also
Artificial intelligence applications
Textmining | Fouille de textes | Szövegbányászat | Text mining | Text mining | Text mining