On Solr and Mahout. Interview with Grant Ingersoll
“When does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.”–Grant Ingersoll.
I have interviewed Grant Ingersoll, CTO and co-founder of LucidWorks. Grant is an active member of the Lucene community, and co-founder of the Apache Mahout machine learning project.
I wish you a Happy and a Peaceful 2015!
Q1. Why LucidWorks Search? What kind of value-add capabilities does it provide with respect to the Apache Lucene/Solr open source search?
Grant Ingersoll: I like to think of LucidWorks Search (LWS) as Solr++, that is, we give you all of the goodness of Solr and then some more. Our primary focus in building LWS is in 4 key areas:
1. IT integration — Make it easy to consume Solr within an IT organization via things like monitoring, APIs, installation and so on.
2. Enterprise readiness — Large enterprises have 1 of everything and they all have a multitude of security requirements, so we focus on making it easier to operate in these environments via things like connectors for data acquisition, security and the like
3. Tools for Subject Matter Experts — These are aimed at technical non developers like Business Analysts, Merchandisers, etc. who are responsible for understanding who asked for what, when and why. These tools are primarily aimed at understanding relevancy of search results and then taking action based on business needs.
4. Deliver a supported version of the open source so that companies can reliably deploy it knowing they have us to back them up.
Q2. At LucidWorkd you have integrated Apache open source projects to deliver a Big Data application development and deployment platform. What does the emerging big data stack look like?
Grant Ingersoll: We use capabilities from the Hadoop ecosystem for a number of activities that we routinely see customers struggling with when they try to better understand their data. In many cases, this boils down to large scale log analysis to power things like recommendation systems or Mahout for machine learning, but it also can be more subtle like doing large scale content extraction from Office documents or natural language processing approaches for identifying interesting phrases. We also rely on Zookeeper quite heavily to make sure that our cluster stays in a happy state and doesn’t suffer from split brain issues and cause failures.
Q3. How does it different with respect to other Big Data Hadoop-based distributions such as Cloudera, Hortonworks, and Greenplum Pivotal HD?
Grant Ingersoll: I can’t speak to their integrations in great detail, but we integrate with all of them (as well as partner with most of them), so I guess you would say we try to work at a layer above the core Hadoop infrastructure and focus on how the Hadoop ecosystem can solve specific problems as opposed to being a general purpose tool. For instance, we ship with a number of out of the box workflows designed to solve common problems in search like click-through log analysis and whole collection document clustering so you don’t have to write them yourself.
Q4. How does it work to build a framework for big data with open source technologies that are “pre-integrated”?
Grant Ingersoll: Well, you quickly realize what a version soup there is out there, trying to support all the different “flavors” of Hadoop. Other than, it is a lot of fun to leverage the technologies to solve real problems that help people better understand their data. Naturally, there are challenges in making sure all the processes work together at scale, so a lot of effort goes into those areas.
Q5. What happens when big data plus search meets the cloud?
Grant Ingersoll: You get cost effective access and insight into your data instead of a big science experiment. In many ways, the benefits are the same as search and ranking in on-prem situations plus the added benefits the cloud brings you in terms of costs, scaling and flexibility. Of course, the well-documented challenge in the cloud is how to get your data there. So, for users who already have their data in the cloud, it’s an especially easy win, for those who don’t, we provide connectors that help.
Q6. Solr Query includes simple join capability between two document types. How do such queries scale with Big Data?
Grant Ingersoll: Solr scales quite well (billions of documents and very large query volumes).
In fact, we’ve seen it routinely scale linearly to quite large cluster sizes.
As with databases, joins require you to pay attention to how you do the join or whether there are better ways of asking your question, but I have seen them used quite successfully in the appropriate situation. At the end of the day, I try to remain pragmatic and use the appropriate tool for the job. A search engine can handle some types of joins, but that doesn’t always mean you should do it in a search engine. I like to think of a search engine as a very fast ranking engine. If the problem requires me to rank something, than search engine technology is going to be hard to beat. If you need it to do all different kinds of joins across a large number of document types or constant large table scans, it may be appropriate to do in a search engine and it may not. It’s a classic “it depends” situation. That being said, over the past few years, these kinds of problems have become much more efficient to do in a search engine thanks to a multitude of improvements the community has made to Lucene and Solr.
Q7. The Apache Mahout Machine Learning Project’s goal is to build scalable machine learning libraries. What is current status of the project?
Grant Ingersoll: We released 0.9 and are working towards a 1.0. The main focus lately has been on preparing for a 1.0 release by culling old, unused code and tightly focusing on a core set of algorithms which are tried and true that we want to support going forward.
Q8. What kind of algorithms is Apache Mahout currently supporting?
Grant Ingersoll: I tend to think of Mahout as being focused on the three “C’s”: clustering, classification and collaborative filtering (recommenders). These algorithms help people better understand and organize their data. Mahout also has various other algorithms like singular value decomposition, collocations and a bunch of libraries for Java primitives.
Q9. How does Mahout relies on the Apache Hadoop framework?
Grant Ingersoll: Many of the algorithms are written for Hadoop specifically, but not all. We try to be prudent about where it makes sense to use Hadoop and where it doesn’t, as not all machine learning algorithms are best suited for Map-Reduce style programming. We are also looking at how to leverage other frameworks like Spark or custom distributed code.
Q10. Who is using Apache Mahout and for what?
Grant Ingersoll: It really spans a lot of interesting companies, ranging from those using it to power recommendations to others classifying users to show them ads. At LucidWorks, we use Mahout for identifying statistically interesting phrases, clustering and classification of user’s query intent and more.
Q11. How scalable is Apache Mahout? What are the limits?
Grant Ingersoll: That will depend on the algorithm. I haven’t personally run an exhaustive benchmark, but I’ve seen many of the clustering and classification algorithms scale linearly.
Q12. How do you take into account user feedback when performing Recommendation mining with Apache Mahout?
Grant Ingersoll: Mahout’s recommenders are primarily of the “collaborative filtering” type, where user feedback equates to a vote for a particular item. All of those votes are, to simplify things a bit, added up to produce a recommendation for the user. Mahout supports a number of different ways of calculating those recommendations, since it is a library for producing recommendations and not just a one size fits all product.
Q13. Looking at three elements: Data, Platform, Analysis, what are the main challenges ahead?
Grant Ingersoll: I’d add a fourth element: the user. Lots of interesting challenges here:
When do we get past the hype cycle of big data and into the nitty gritty of making it real? That is, when does it get practical for most people, not just the Google’s and the Facebook’s of the world? I’ve seen some cool usages of big data over the years, but I also see a lot of people with a solution looking for a problem.
How do we leverage the data, the platform and the analysis to make us smarter/better off instead of just better marketing targets? How do we use these tools to personalize without offending or destroying privacy?
How do we continue to meet scale requirements without breaking the bank on hardware purchases, etc?
Qx. Anything you wish to add?
Grant Ingersoll: Thanks for the great questions!
Grant Ingersoll, CTO and co-founder of LucidWorks, is an active member of the Lucene community – a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project and a long-standing member of the Apache Software Foundation. He is co-author of “Taming Text” from Manning Publications, and his experience includes work at the Center for Natural Language Processing at Syracuse University in natural language processing and information retrieval.
Ingersoll has a Bachelor of Science degree in Math and Computer Science from Amherst College and a Master of Science degree in Computer Science from Syracuse University.
– Taming Text How to Find, Organize, and Manipulate It
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris
Softbound print: September 2012 (est.) | 350 pages, Manning, ISBN: 193398838X
Follow ODBMS.org on Twitter: @odbmsorg