in

5 of the Best Free and Open Source Data Mining Software

- - 23 comments
The process of extracting patterns from data is called data mining. It is recognized as an essential tool by modern business since it is able to convert data into business intelligence thus giving an informational edge. At present, it is widely used in profiling practices, like surveillance, marketing, scientific discovery, and fraud detection.


There are four kinds of tasks that are normally involve in Data mining:

* Classification - the task of generalizing familiar structure to employ to new data
* Clustering - the task of finding groups and structures in the data that are in some way or another the same, without using noted structures in the data.
* Association rule learning - Looks for relationships between variables.
* Regression - Aims to find a function that models the data with the slightest error.

For those of you who are looking for some data mining tools, here are five of the best open-source data mining software that you could get for free:


Orange
Orange is a component-based data mining and machine learning software suite that features friendly yet powerful, fast and versatile visual programming front-end for explorative data analysis and visualization, and Python bindings and libraries for scripting. It contains complete set of components for data preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. It is written in C++ and Python, and its graphical user interface is based on cross-platform Qt framework.


RapidMiner
RapidMiner, formerly called YALE (Yet Another Learning Environment), is an environment for machine learning and data mining experiments that is utilized for both research and real-world data mining tasks. It enables experiments to be made up of a huge number of arbitrarily nestable operators, which are detailed in XML files and are made with the graphical user interface of RapidMiner. RapidMiner provides more than 500 operators for all main machine learning procedures, and it also combines learning schemes and attribute evaluators of the Weka learning environment. It is available as a stand-alone tool for data analysis and as a data-mining engine that can be integrated into your own products.


Weka
Written in Java, Weka (Waikato Environment for Knowledge Analysis) is a well-known suite of machine learning software that supports several typical data mining tasks, particularly data preprocessing, clustering, classification, regression, visualization, and feature selection. Its techniques are based on the hypothesis that the data is available as a single flat file or relation, where each data point is labeled by a fixed number of attributes. Weka provides access to SQL databases utilizing Java Database Connectivity and can process the result returned by a database query. Its main user interface is the Explorer, but the same functionality can be accessed from the command line or through the component-based Knowledge Flow interface.


JHepWork
Designed for scientists, engineers and students, jHepWork is a free and open-source data-analysis framework that is created as an attempt to make a data-analysis environment using open-source packages with a comprehensible user interface and to create a tool competitive to commercial programs. It is specially made for interactive scientific plots in 2D and 3D and contains numerical scientific libraries implemented in Java for mathematical functions, random numbers, and other data mining algorithms. jHepWork is based on a high-level programming language Jython, but Java coding can also be used to call jHepWork numerical and graphical libraries.


KNIME
KNIME (Konstanz Information Miner) is a user friendly, intelligible, and comprehensive open-source data integration, processing, analysis, and exploration platform. It gives users the ability to visually create data flows or pipelines, selectively execute some or all analysis steps, and later study the results, models, and interactive views. KNIME is written in Java, and it is based on Eclipse and makes use of its extension method to support plugins thus providing additional functionality. Through plugins, users can add modules for text, image, and time series processing and the integration of various other open source projects, such as R programming language, Weka, the Chemistry Development Kit, and LibSVM.


If you know of other free and open-source data mining software, please share them with us via comment.

23 comments

  1. You should have included ELKI (Environment for DeveLoping KDD-Applications Supported by Index-Structures), which is a data mining software framework written in Java with a focus on clustering and outlier detection methods

    ReplyDelete
  2. Thanks for including KNIME. It is indeed one of the best, if not the best data mining software available.

    ReplyDelete
  3. Rattle: http://rattle.togaware.com/

    ReplyDelete
  4. ROOT is a rather extensive framework for data analysis, written in C and originating from CERN. http://root.cern.ch/

    Not sure if it is classified as "data mining tool" though?

    ReplyDelete
  5. Thanks for this short overview. Very interesting to see a few of these tools.

    ReplyDelete
  6. You should have included R (www.-r-project.org), which recently has scored 2nd on a poll by KD Nuggets on the most used DM tools.

    ReplyDelete
  7. @Luis What you are missing is simple. Data mining provides certain ways to extract data. However R or something simpler could not solve the problem leave alone providing the solution

    ReplyDelete
  8. I'd like to know what your qualifications are for judging/comparing these and other data mining software packages.

    ReplyDelete
  9. Are these for SQL databases only? Or also work on NoSQL?

    ReplyDelete
  10. Mircea, if you have Python installed, you can get a single file NoSQL module wrapped around SQLite to quickly persist the results of mined data, it's called y_serial; see http://yserial.sourceforge.net -- very rapid solution.

    ReplyDelete
  11. @rtd What concrete "way of extracting data" you do not have in R? Let me just name a few that you have because contrary to what you are saying R is far from a "simple" tool and the list would be too extensive. In R you can read data from any DBMS (MySQL, SQL Server, ORACLE, etc. you name it); you can obtain data directly from the web (e.g. Yahoo finance); you can obviously import data from text and binary files in a huge name of formats; you can import data from other data analysis packages (statistical, spread sheets, etc.); you have a huge (and growing) number of packages for interfacing/importing with a large number of software tools from different fields (bioinformatics, geographic information systems, etc.); and the list goes on. So what exactly are you talking about?! If you are talking about querying the data (indexing, subsetting, aggregation, or other summaries), then your observation is even more absurd because R can do all of that in a much more powerful and flexible way than the other mentioned tools. So please be objective on your comments otherwise you are just leading other people in the wrong direction. If you want to mention a weakness of R compared to the other tools, then there is an obvious one: R does not come with a graphical user interface and that may be bad for beginners. Still even in terms of GUIs there are a few ones available as extra packages for subsets of R functionality and even one (RATTLE) for data mining.

    ReplyDelete
  12. good job, jun ! ... Thanks
    keep it up ...

    ReplyDelete
  13. @Luis, well for beginners you're right. Think about it. Would you want to do everything in command line? hunn? Its been years since our industry has moved out of that while R is still stuck with that. The CTRAN mirror almost never works and with pre-caching off I can't use for anything but trivial tasks. That is why I said Simple. Simple as in used for "Simple tasks" not a "simple system". Talk about DBMS, in general it should support a generic interface like ODBC to fetch from "ALL" databases. Last year we were trying to import from DB2 and it won't work. you must agree that RATTLE is still primitive.

    ReplyDelete
  14. I believe it is important to note that 3 out of the 5 analytic tools mentioned (RapidMiner, Weka, and KNIME) + R/Rattle support PMML (Predictive Model Markup Language).

    PMML is the standard language to represent predictive analytic and data mining models. With a focus on interoperability, PMML allows for models to be easily moved around between applications for visualization or execution. For example, one can export a model built in KNIME in PMML and directly upload it in Weka.

    In addition to being embraced by Open Source tools, PMML is also supported by all the top commercial analytic tools, such as IBM/SPSS, SAS, KXEN, ...

    ADAPA, from Zementis, for example, is a universal PMML reader. It is able to deploy PMML models from all the different tools and make them instantly available for execution (as web services or through a web console). In this way, PMML provides a route to production IT deployment for predictive models built in commercial as well as open source tools.

    ReplyDelete
  15. Yeah, you definitely missed ELKI, as Ganther said.

    PMML is machine learning, not so much data mining. It's useful for classifiers, but there is a whole world of data mining techniques not covered by it.

    ReplyDelete
  16. It's good to see the overview of ELKI,due to which new person get some idea about it.So thanks

    ReplyDelete
  17. which of the tool support for graph data mining . I mean for graph data

    ReplyDelete
  18. Which one support for graph data mining? I mean graph data

    ReplyDelete
  19. AnonymousMay 21, 2012

    They all support.
    By the way, jHepWork can be used on the Android platform!

    ReplyDelete
  20. AnonymousJuly 07, 2012

    Hi,
    i would like to know which tool could be useful for implementing fuzzy datamining in time series data

    ReplyDelete
  21. TANAGRA is free DATA MINING software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area. ANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms. TANAGRA runs under almost Windows systems, in any case it has been tested under Windows 98, 2000, XP, Vista and Windows 7.

    ReplyDelete
  22. Include R. It is a great open source software. Also there is pspp..... but I don't know much about pspp.

    ReplyDelete