Friday, February 03, 2012

sub transactional big-data and data analysis

Pentaho is known for its BI solutions; however it is also (less) known for the dig-data and big-analysis expertise they have in combination with solutions like for example Hadoop. Big-data is not quite a set term; we cannot state when something is big-data and when something is not big-data. In general big-data refers to a very fast growing set of data where large sets of data are added to in real-time. Some good examples could be twitter who is storing tweets in a very rapid way, credit card companies who store all transactions and for example stock trading companies who store all stock transactions and the information around the market.

In this video presentation James Dixon, CTO at Pentaho, is stating that for a large set of companies Big-data is actually sub-transactional. The sub-transactional events that happen between or before a business transaction (aka buying or selling something). The information for example on how did a person came to this page on my website where he or she clicked the order button. This is commonly not seen as big-data however in essence it is big-data and a very interesting part to jump into. This means that we can see storing information about the travel patterns of people on your website as big-data and that we can see click analysis as big-analysis.

(Big-)data analysis on website visitors and how they are clicking to finally come to your product is already done at this moment by a number of software vendors. Issue however is that this is done after the events happened. Information is stored and commonly analyzed overnight and after that it is used to improve the website. If you have a computing cluster which can do you big-analysis on your big-data fast enough your could have your website content adept to the click patterns in a more smart and faster way than current solutions are offering.

The video also gives a first glimpse of the way Pentaho thinks the big-data architecture landscape looks like and how you should think about data lakes, data marts, data warehouses and ad-Hoc queries and why you should never delete data even if you are not using it at this moment it might be needed at a later moment and could make sense to you then.

No comments: