Thursday, 20 September 2012

Reporting framework with Apache POI, MS Excel and BIRT


Recently I came across a unique requirement, where in an inventory entitlement analysis model was developed in MS Excel and the customer wanted to move away from MS Excel model and turn it into a more scalable, extendable model, which could then further be deployed as SaaS model. Though the requirement was to move away from MS Excel, considering the business domain peculiarities, practicalities and user behavior, using MS Excel in some way was a strong possible solution option. Also the requirements demanded usage of statistical computation for forecasting and estimations where in the calculation power of MS Excel would have come real handy, so that was finally the option of choice. 

Problem Statement

The pressing need of moving away from MS Excel based model for the customer was result of few things. Firstly, the MS Excel based model was “evolved” to what it was and so a lot of static references were introduced all over the sheets and it had reached a state where making a slight change like adding a new product or a warehouse would mean a lot of work and making it very rigid and practically impossible to scale. Another reason to move away from MS Excel based model was the need to have a relational database and a way to maintain full history of actual and forecast data, to be able to do better prediction and enhanced reporting. 

Solution

Basically, I had three options to choose from – 
1. Develop the entire app in excel by re-modeling the data storage and using macros 
2. Use excel as front end and plug-in a database engine like MS Access 
3. Use excel only as calculation engine and develop a web based application that ‘uses’ the excel model using third party libraries for excel api. 

After studying the customer problem statements, requirements and understanding their business, and knowing the typical user behavior, I decided to still make use of MS Excel, in a more optimal fashion, which led me to choose the third option. 

Overall solution then went on evolving as we dug deeper into each component of the high level architecture. JSF was chosen to cater for the pluggable user interface need on account of its recent popularity and active community (struts community didn’t show any activity for last few months). For relational database we chose MySql’s community version – that was an easy choice. Another very crucial requirement was to be able to generate crosstab reports, line and pie charts. Keeping the reporting dynamism in mind, I decided to use a reporting framework – again, an open source one. There were couple of options to choose from where in Jasper and BIRT topped the list. Some of the pros we saw in BIRT compared to Jasper were in terms of ease of development, implementation of crosstabs and drill-down and linked reports. After few POCs, we choose BIRT over Jasper. BIRT also had a third party library (JSF4BIRT) which made it very easy to incorporate BIRT report into JSF. 

The application being a forecasting application, involved some complex evaluation models and formulae which, I thought, would be very easy to implement in MS Excel and fairly complex, otherwise. So I decided to deploy MS Excel to this task. Apache’s POI was then deployed to facilitate integration of the java application with MS Excel documents. 

For other calculations, which were huge in number but not so complex, PLSQL seemed to be right place since if they were to be done in the middle layer, would involve huge number of database calls hampering the performance, so in order to do these calculations closer to where data resides, PLSQL Stored Procedures were deployed. 

There we had, as envisioned, a web based J2EE application where in, the business user could easily maintain the master data and during every forecast run (typically weekly or monthly), download an automatically generated, dynamic MS Excel based input template. He/she would then fill it in at leisure and upload it back into the system. Basic data validations being handled using the MS Excel validation framework itself, System would then do further business validations and dump the data into the staging area. User would then “run” the model to trigger the system to generate forecasts and variety of reports in batch and using BIRT reporting engine, just render the filter based parameterized crosstabs and charts, on the report pages.

Future

This framework could further be enhanced to incorporate various goodies like BI kind of features letting users create their own reports, adapters with ERPs to automate process of data population and report generation, tagging exceptional data points and preserving tags for future reference and so on.

Thursday, 19 April 2012

Unstructured Data Mining – A huge challenge or a huge opportunity ?

Gartner estimates that unstructured data constitutes 80% of the whole enterprise data. A huge proportion of this unstructured data comprises chat transcripts, emails and other informal and semi-formal internal and external communications like MS Word and PDF documents. Usually such text is meant for human consumption. However, now with huge amounts of such text being present, both online and within the enterprise, it is important to mine such text using computers.

By Definition, Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

An insidious dilemma right now is regarding the issue of unstructured information. In businesses and offices, file storage and database is strewn with sensitive data that’s uncategorized, unclassified, and unprotected. With all the data exchanged within and between offices, collecting and organizing these data is proving to be a challenge.

Managing unstructured information is vital for any business, as these uncategorized data may prove to be vital in the decision-making process. Much investment is going into searching and systematizing data in networks. This is because a host of vital information may be found in these free-form texts, both in soft and hard form ) such as the following:

•  Client Responses - This information may just be buried within countless emails and correspondence.
•  Market Rival - A slew of new products and services manufactured by the competition may be analyzed by uncategorized research documents.
•  Market Segments - Feedback from consumers and customers may be derived from call transcripts and user comments.

For a company, the successful classification and management of unstructured information may lead to more profitable decisions and business opportunities.

Dealing with unstructured data
Data mining and text analytics and noisy text analytics techniques are different methods used to find patterns in, or otherwise “interpret”, this information. Common techniques for structuring text usually involve manual tagging with metadata or Part-of-speech tagging for further text mining-based structuring. There are several commercial solutions which help one to analyze and understand unstructured data for business applications. Apache UIMA, an Apache product, on the other hand, is an open source option.

UIMA (Unstructured Information Management Architecture) provides a common framework for processing this information to extract meaning and create structured data about the information.

UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located at.

UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". To do so, it provides components which implement interfaces defined by the framework and provides self describing metadata via XML descriptor files. It additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

Additional toolset is also provided to further extend UIMA’a capabilities of extracting meaningful information from the unstructured data. One of such tool is called CFE ( Configurable Feature Extractor ) that enables feature extraction in a very generalized way. This is done using rules expressed in FESL (Feature Extraction Specification Language) in XML form. FESL's rule semantics allow the precise identification of the information that is required to be extracted by specifying precise multi-parameter criteria.

In my opinion, Unstructured Data Management will provide huge help to government and semi-government agencies, more than IT industries, where millions of unstructured/semi-structured documents are lying on shelves in physical form or in computers in soft form, waiting to be explored, read again and referred to.

There are some open source, powerful search platforms like Apache Solr, which claim to integrate with UIMA seamlessly which further strengthens the case for use of these open source technologies put together to solve the problem of this enterprise world. Having said that, is that really a huge problem or a challenge to face or really an opportunity that an IT service company should grab?

References
http://en.wikipedia.org/wiki/
http://www.unstructuredinformation.com/
http://uima.apache.org
http://lucene.apache.org/solr/
http://domino.research.ibm.com/library/Cyberdig.nsf/papers/BA59E9190C9534B4852574F000482E86/$File/rc24673.pdf

Monday, 2 April 2012

Exploring ANTLR

Getting ANTLR to work, atleast as a proof of concept, wasn't trivial as I was very new to the whole concept of defining language constructs, writing compilers and so on. But once you get accustomed to the terms and this technology, you feel really empowered. You can theoretically write your own language or convert anything, into anything else :)

I am going to cover basics about ANTLR here and the output of the POC.

The Big Picture



* The lexical analyzer, or lexer, breaks up the input stream into tokens.
* Tokens are vocabulary symbols emanating from the lexical analyzer.
* The parser feeds off this token stream and tries to recognize the sentence structure.
* Abstract syntax tree (AST) is a TREE representation of the language syntax.
* Output can be AST or text ( using StringTemplates ).


Definitions, Tools and usage

* Grammar describes the syntax of a language.
* ANTLTWorks – IDE to write, verify, test, debug the grammar
* Commandline - java org.antlr.Tool grammar.g
* Generates Tokens, Lexer, Parser java classes

Imagine a maze with a single entrance and single exit that has words written on the floor. Every path from entrance to exit generates a sentence by “saying” the words in sequence. In a sense, the maze is analogous to a grammar that defines a language.


Grammar Example

prog : stat+ ;
stat : expr NEWLINE
| ID '=' expr NEWLINE
| NEWLINE;
expr : multExpr (('+' |'-' ) multExpr)* ;
multExpr : atom ('*' atom)* ;
atom : INT
| ID
| '(' expr ')’ ;
ID : ('a'..'z' |'A'..'Z' )+ ;
INT : '0'..'9' + ;
WS : (' ' |'\t' |'\n' |'\r' )+ {skip();} ;


Test Class


public class Test {
public static void main(String[] args) throws Exception {
// Create an input character stream from standard in
ANTLRInputStream input = new ANTLRInputStream(System.in);

// Create an ExprLexer that feeds from that stream
ExprLexer lexer = new ExprLexer(input);

// Create a stream of tokens fed by the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);

// Create a parser that feeds off the token stream
ExprParser parser = new ExprParser(tokens);

// Begin parsing at rule prog
parser.prog();
}
}


AST vs Parse Tree for 3+4*5

An AST is to be distinguished from a parse tree, which represents the sequence of rule invocations used to match an input stream.

Grammar File Structure

Filename : grammarName.g

/** This is a document comment */
grammarType grammarName;
«optionsSpec»
«tokensSpec»
«attributeScopes»
«actions»
/** doc comment */
rule1 : ... | ... | ... ;
rule2 : ... | ... | ... ;
...

The order of the grammar sections must be as shown, with the rules appearing after all the other sections.


Sample Grammar

grammar T;
options {
language=Java;
}
@members {
String s;
public String changeCase(String in)
{
return in.toUpperCase();
}
}
r : ID '#' {s = $ID.text; System.out.println("found "+s);} -> declTemplate(id={$ID.text}) ;
ID : 'a'..'z' + ;
WS : (' ' |'\n' |'\r' )+ {skip();} ; // ignore whitespace


Template file : mytemplate.stg


declTemplate (id) ::= <<
var ;
>>


Actions embedded within Rules


grammar T;
options {language=Java;}
@header {
package org.antlr.test;
}
r[int a, String b] returns [int n]
@init {
$n=$a; // init return value
}
@after {
System.out.println("returning value n="+$n);
}
: ... {$n=23;}
| ... {$n=9;}
| ... {$n=1;}
| // use initialized value of n
;


Rules Error Handling


void rule() throws RecognitionException {
try {
«rule-body»
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
}


Overriding default error handling

To get these rich error messages, override two methods from BaseRecognizer,
displayRecognitionError( )
getTokenErrorDisplay( )

Example of a rich message

$ java Test
<< 1+;
<< EOF
>> line 1:2 [prog, stat, expr, multExpr, atom] no viable alternative,
token=[@2,2:2=';',<7>,1:2] (decision=5 state 0)
decision=<<35:1: atom : ( INT | '(' expr ')' );>>
found expr: 1+;

Grammar Level Options

options {
language = Java / C#;
output = template / AST;
backtrack = true / false;
memorize = true / false;
rewrite = true / false;
}

Rule Attributes

Attribute Type
--------- ------------------------------------------------------------
text String // text matched by the rule
start Token // first token matched by the rule
stop Token // last token matched by the rule
tree Object // AST computed by the rule
st StringTemplate // Template computed for this rule


Example

r : ID '#' {s = $ID.text; System.out.println("found "+s);} -> declTemplate(id={$ID.st}) ;


POC : PLSQL to JAVA Converter



References

Websites
http://www.antlr.org/
http://www.stringtemplate.org/
http://www.antlr.org/grammar/list
http://www.antlr.org/grammar/1279318813752/PLSQL.g

Books
The.Definitive.ANTLR.Reference.pdf