Thursday, 19 April 2012

Unstructured Data Mining – A huge challenge or a huge opportunity ?

Gartner estimates that unstructured data constitutes 80% of the whole enterprise data. A huge proportion of this unstructured data comprises chat transcripts, emails and other informal and semi-formal internal and external communications like MS Word and PDF documents. Usually such text is meant for human consumption. However, now with huge amounts of such text being present, both online and within the enterprise, it is important to mine such text using computers.

By Definition, Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

An insidious dilemma right now is regarding the issue of unstructured information. In businesses and offices, file storage and database is strewn with sensitive data that’s uncategorized, unclassified, and unprotected. With all the data exchanged within and between offices, collecting and organizing these data is proving to be a challenge.

Managing unstructured information is vital for any business, as these uncategorized data may prove to be vital in the decision-making process. Much investment is going into searching and systematizing data in networks. This is because a host of vital information may be found in these free-form texts, both in soft and hard form ) such as the following:

•  Client Responses - This information may just be buried within countless emails and correspondence.
•  Market Rival - A slew of new products and services manufactured by the competition may be analyzed by uncategorized research documents.
•  Market Segments - Feedback from consumers and customers may be derived from call transcripts and user comments.

For a company, the successful classification and management of unstructured information may lead to more profitable decisions and business opportunities.

Dealing with unstructured data
Data mining and text analytics and noisy text analytics techniques are different methods used to find patterns in, or otherwise “interpret”, this information. Common techniques for structuring text usually involve manual tagging with metadata or Part-of-speech tagging for further text mining-based structuring. There are several commercial solutions which help one to analyze and understand unstructured data for business applications. Apache UIMA, an Apache product, on the other hand, is an open source option.

UIMA (Unstructured Information Management Architecture) provides a common framework for processing this information to extract meaning and create structured data about the information.

UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located at.

UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". To do so, it provides components which implement interfaces defined by the framework and provides self describing metadata via XML descriptor files. It additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

Additional toolset is also provided to further extend UIMA’a capabilities of extracting meaningful information from the unstructured data. One of such tool is called CFE ( Configurable Feature Extractor ) that enables feature extraction in a very generalized way. This is done using rules expressed in FESL (Feature Extraction Specification Language) in XML form. FESL's rule semantics allow the precise identification of the information that is required to be extracted by specifying precise multi-parameter criteria.

In my opinion, Unstructured Data Management will provide huge help to government and semi-government agencies, more than IT industries, where millions of unstructured/semi-structured documents are lying on shelves in physical form or in computers in soft form, waiting to be explored, read again and referred to.

There are some open source, powerful search platforms like Apache Solr, which claim to integrate with UIMA seamlessly which further strengthens the case for use of these open source technologies put together to solve the problem of this enterprise world. Having said that, is that really a huge problem or a challenge to face or really an opportunity that an IT service company should grab?



  1. Thank you for the great? information,
    I sea that you gave the simpliest defenition of The
    data mining, which is good..
    we want more from these objective vedios on youtube.. Thank You..

    Some Good information..

  2. Hi TaskTrek
    I am thinking of putting together a prototype. If that materializes, i'll put something up on youtube.

  3. Thanks for providing such an useful content. Multi-Tenant Cloud Storage is an ideal solution for Unstructured Data Storage management. Please share more useful thoughts with us.

  4. Yes. John .. with data storage on cloud it becomes more versatile and multi tenancy would add different dimention to it. That will let one servisize this solution !