Preparing your computer for machine learning

Machine learning offers a new set of powerful tools in environmental science. There are a broad range of data-driven approaches that can be applied to the spectrum of challenges that exist. An initial step then is to prepare a computer to undertake these tasks, and in so doing, to enable the widest possible range of machine learning capabilities. In this blog we will consider how to prepare your computer for machine learning.

Choosing a language

To develop custom and flexible approaches in environmental data science, you will likely need to produce code, and for this you need to select a language. A good place to start is with the powerful scripting languages of Python and R. These two languages are gaining pre-eminence amongst data scientists for handling and investigating the data challenges of machine learning. That is not to say these are the only languages available, there are many more – scala, julia and perl, to name a few, have their place. However, python and R are a good place to start! A particularly powerful feature of these languages is their ‘extensibility’. Straight out of the box they both have a wide range of tried and tested capabilities, but where additional features are required there is a huge range of libraries of extended functionality that can be linked into your projects – such as machine learning functions (eg. scikit-learnTensorFlow, and Theano), and advanced data handling capabilities (eg. DaskNumPypandas, and Numba) as well as a range of presentational and visualisation tools. There are many such libraries and selecting from them can pose a challenge.

The good news is that there is a full set of software tools that can be installed together in one go that includes all of these programming languages and libraries, designed for data science projects – this is Anaconda. If you install Anaconda, you will have installed 1,500+ packages for data science, and include both the latest Python and R environments and much more. Best still it is free to do (although paid-for offers exist that provide enterprise support). A package manager ‘Conda’ will help you keep the software tools up to date.

Installing Anaconda

The preferred means to install Anaconda varies depending on your computer type. Installers for Anaconda are available online and installation documentation is available. For Windows, this is the best means to proceed. Although installers are also available for Mac OSX, for this it is also possible, and may be preferable, to use a system package manager such as brew. Brew is installed first, and then the anaconda ‘formulae‘ can be installed after.

Weka – an alternative

The Anaconda suite of tools can be overwhelming in that you have the 1,500 plus packages and a huge choice of tools.

Perhaps an easier introduction to Machine learning is the tool Weka. Weka was developed by the University of Waikato in New Zealand, and offers a collection of machine learning algorithms for data mining tasks. It includes tools for data preparation, classification, regression, clustering, association rules mining, and visualisation. A series of videos, an excellent manual and a MOOC can help the learning process. Many people use Weka to explore data and apply machine learning algorithms, and then if needed hand craft code in Python or R to further explore the data as required.