Data Science Tools: Exploratory Data Analysis
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
Weka is open source software issued under the GNU General Public License.
We have put together several free online courses that teach machine learning and data mining using Weka. Check out the website for the courses for details on when and how to enrol. The videos for the courses are available on Youtube.
Yes, it is possible to apply Weka to big data!
Use Fusion Tables with other APIs too
- Create a Google Map layer with your Fusion Table data (Maps API).
- Produce charts and graphs (Chart Tools API).
- Set specific types, roles, and values that allow secure collaboration (Google Drive Web APIs).
- Collect data through an input form (Google Apps Script).
The Fusion Tables community: Join the Fusion Tables API announce group to get updates on changes and improvements to the API.
Gephi The Open Graph Viz Platform
Gephi is the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free.
Runs on Windows, Mac OS X and Linux.
Gephi is a tool for data analysts and scientists keen to explore and understand graphs. Like Photoshop™ but for graph data, the user interacts with the representation, manipulate the structures, shapes and colors to reveal hidden patterns. The goal is to help data analysts to make hypothesis, intuitively discover patterns, isolate structure singularities or faults during data sourcing. It is a complementary tool to traditional statistics, as visual thinking with interactive interfaces is now recognized to facilitate reasoning. This is a software for Exploratory Data Analysis, a paradigm appeared in the Visual Analytics field of research.
OpenRefine : OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Please note that since October 2nd, 2012, Google is not actively supporting this project, which has now been rebranded to OpenRefine. Project development, documentation and promotion is now fully supported by volunteers. Find out more about the history of OpenRefine and how you can help the community.
Using OpenRefine – The Book
Using OpenRefine, by Ruben Verborgh and Max De Wilde, offers a great introduction to OpenRefine. Organized by recipes with hands on examples, the book covers the following topics:
- Import data in various formats
- Explore datasets in a matter of seconds
- Apply basic and advanced cell transformations
- Deal with cells that contain multiple values
- Create instantaneous links between datasets
- Filter and partition your data easily with regular expressions
- Use named-entity extraction on full-text fields to automatically identify topics
- Perform advanced data operations with the General Refine Expression Language
Orange: Open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox.
Perform simple data analysis with clever data visualization. Explore statistical distributions, box plots and scatter plots, or dive deeper with decision trees, hierarchical clustering, heatmaps, MDS and linear projections. Even your multidimensional data can become sensible in 2D, especially with clever attribute ranking and selections.
Interactive data exploration for rapid qualitative analysis with clean visualizations. Graphic user interface allows you to focus on exploratory data analysis instead of coding, while clever defaults make fast prototyping of a data analysis workflow extremely easy. Place widgets on the canvas, connect them, load your datasets and harvest the insight!
When teaching data mining, we like to illustrate rather than only explain. And Orange is great at that. Used at schools, universities and in professional training courses across the world, Orange supports hands-on training and visual illustrations of concepts from data science. There are even widgets that were especially designed for teaching.
Add-ons Extend Functionality
Use various add-ons available within Orange to mine data from external data sources, perform natural language processing and text mining, conduct network analysis, infer frequent itemset and do association rules mining. Additionally, bioinformaticians and molecular biologists can use Orange to rank genes by their differential expression and perform enrichment analysis.
Project Jupyter is an open source project was born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages. Jupyter will always be 100% open source software, free for all to use and released under the liberal terms of the modified BSD license
Pandas, Python Data Analysis Library
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
IPython, IPython provides a rich architecture for interactive computing with:
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.