Python & 机器学习入门指导

Getting started with Python & Machine Learning
(阅者注:这是一篇关于机器学习的指导入门,作者大致描述了用Python来开始机器学习的优劣,以及如果用哪些Python 的package 来开始机器学习。)

Machine learning is eating the world right now. Everyone and their mother are learning about machine learning models, classification, neural networks, and Andrew Ng. You’ve decided you want to be a part of it, but where to start?

In this article we’ll cover some important characteristics of Python and why it’s great for machine learning. We’ll also cover some of the most important libraries it has for ML, and if it piques your interest, some places where you can learn more.

Why is Python used for machine learning?

Python is a great choice for machine learning for several reasons. First and foremost, it’s a simple language on the surface; even if you’re not familiar with Python, getting up to speed is very quick if you’ve ever used any other language with C-like syntax (i.e. every language out there). Second, Python has a great community, which results in good documentation and friendly, comprehensive answers in StackOverflow (fundamental!). Third, also stemming from the great community, there are plenty of useful libraries for Python (both as “batteries included” and third party), which solve basically any problem that you can have (including machine learning).

But I heard Python is slow!

Yeah and it’s true. Python isn’t the fastest language out there: all those handy abstractions come at a cost.

But here’s the trick: libraries can and do offload the expensive calculations to the much more performant (but harder to use) C and C++. For instance, there’s NumPy, which is a library for numerical computation. It’s written in C, and it’s fast. Practically every library out there that involves intensive calculations uses it — almost all the libraries listed next use it in some form. So if you read NumPy, think fast.

Therefore, you can make your scripts run basically as fast as straight up writing them in a lower level language. So there’s really nothing to worry about when it comes to speed.

Python libraries to check out

Scikit-learn

Are you starting out in machine learning? Want something that covers everything from feature engineering to training and testing a model? Look no further than scikit-learn! This fantastic piece of free software provides every tool necessary for machine learning and data mining. It’s the de facto standard library for machine learning in Python, recommended for most of the ‘old’ ML algorithms.

This library does both classification and regression, supporting basically every algorithm out there (support vector machines, random forest, naive bayes, and so on). It’s built in such a way that allows easy switching of algorithms, so experimentation is easy. These ‘older’ algorithms are surprisingly resilient and work very well in a lot of cases.

But that’s not all! Scikit-learn also does dimensionality reduction, clustering, you name it. It’s also blazingly fast since it runs on NumPy and SciPy (meaning that all the heavy number crunching is run on C instead of Python).

Check out some examples to see everything this library is capable of, and the tutorials if you want to learn how it works.

NLTK

While not a machine learning library per se, NLTK is a must when working with natural language processing (NLP). It comes with a bundle of datasets and other lexical resources (useful for training models) in addition to libraries for working with text — for functions such as classification, tokenization, stemming, tagging, parsing and more.

The usefulness of having all of this stuff neatly packaged can’t be overstated. So if you are interested in NLP, check out some tutorials!

Theano

Used widely in research and academia, Theano is the grandfather of all deep learning frameworks. Written in Python, it’s tightly integrated with NumPy. Theano allows you to create neural networks, which are represented as mathematical expressions with multi-dimensional arrays. Theano handles this for you so you don’t have to worry about the actual implementation of the math involved.

It supports offloading calculations to the much faster GPU, which is a feature that everyone supports today, but back when they introduced it this wasn’t the case. The library is very mature at this point and supports a very wide range of operations, which is a great plus when it comes to comparing it with other similar libraries.

The biggest complaint out there is that the API may be unwieldy for some, making the library hard to use for beginners. However, there are wrappers that ease the pain and make working with Theano simple, such as Keras, Blocks and Lasagne.

Interested in learning about Theano? Check out this Jupyter Notebook tutorial.

TensorFlow

The Google Brain team created TensorFlow for internal use in machine learning applications, and open sourced it in late 2015. They wanted something that could replace their older, closed source machine learning framework, DistBelief, which they said wasn’t flexible enough and too tightly coupled to their infrastructure to be shared with other researchers around the world.

And so TensorFlow was created. Learning from the mistakes of the past, many consider this library to be an improvement over Theano, claiming more flexibility and a more intuitive API. Not only can it be used for research but also for production environments, supporting huge clusters of GPUs for training. While it doesn’t support as wide a range of operations as Theano, it has better computational graph visualizations.

TensorFlow is very popular nowadays. In fact, if you’ve heard about a single library on this list, it’s probably this one: there isn’t a day that goes by without a new blog post or paper mentioning TensorFlow gets published. This popularity translates into a lot of new users and a lot of tutorials, making it very welcoming to beginners.

Keras

Keras is a fantastic library that provides a high-level API for neural networks and is capable of running on top of either Theano or TensorFlow. It makes harnessing the full power of these complex pieces of software much easier than using them directly. It’s very user-friendly, putting user experience as a top priority. They manage this by using simple APIs and excellent feedback on errors.

It’s also modular, meaning that different models (neural layers, cost functions, and so on) can be plugged together with little restrictions. This also makes it very easy to extend, since it’s simple to add new modules and connect them with the existing ones.

Some people have called Keras so good that it is effectively cheating in machine learning. So if you’re starting out with deep learning, go through the examples and documentation to get a feel for what you can do with it. And if you want to learn, start out with this tutorial and see where you can go from there.

Two similar alternatives are Lasagne and Blocks, but they only run on Theano. So if you tried Keras and are unhappy with it, maybe try out one of these alternatives to see if they work out for you.

PyTorch

Another popular deep learning framework is Torch, which is written in Lua. Facebook open-sourced a Python implementation of Torch called PyTorch, which allows you to conveniently use the same low-level libraries that Torch uses, but from Python instead of Lua.

PyTorch is much better for debugging since one of the biggest differences between Theano/TensorFlow and PyTorch is that the former use symbolic computation while the latter doesn’t. Symbolic computation means that coding an operation (say, ‘x + y’), it’s not computed when that line is interpreted. Before getting executed it has to be compiled (translated to CUDA or C). This makes debugging harder in Theano/TensorFlow, since an error is much harder to associate with the line of code that caused it. Of course, doing things this way has its advantages, but debugging isn’t one of them.

If you want to start out with PyTorch the official tutorials are very friendly to beginners but get to advanced topics as well.

First steps in machine learning?

Alright, you’ve presented me with a lot of alternatives for machine learning libraries in Python. What should I choose? How do I compare these things? Where do I start?

Our Ape Advice™ for beginners is to try and not get bogged down by details. If you’ve never done anything machine learning related, try out scikit-learn. You’ll get an idea of how the cycle of tagging, training and testing work and how a model is developed.

Now, if you want to try out deep learning, start out with Keras — which is widely agreed to be the easiest framework — and see where that takes you. After you have more experience, you will start to see what it is that you actually want from the framework: greater speed, a different API, or maybe something else, and you’ll be able to make a more informed decision.

And even then, there is an endless supply of articles out there comparing Theano, Torch, and TensorFlow. There’s no real way to tell which one is the good one. It’s important to take into account that all of them have wide support and are improving constantly, making comparisons harder to make. A six month old benchmark may be outdated, and year old claims of framework X doesn’t support operation Y could no longer be valid.

Finally, if you’re interested in doing machine learning specifically applied to NLP, why not check out MonkeyLearn! Our platform provides a unique UX that makes it super easy to build, train and improve NLP models. You can either use pre-trained models for common use cases (like sentiment analysis, topic detection or keyword extraction) or train custom algorithms using your particular data. Also, you don’t have to worry about the underlying infrastructure or deploying your models, our scalable cloud does this for you. You can start for free and integrate right away with our beautiful API.

Want to learn more?

There are plenty of online resources out there to learn about machine learning ! Here are a few:

Final words

So that was a brief intro to machine learning in Python and some of its libraries. The important part is not getting bogged down by details and just trying stuff out. Follow your curiosity, and don’t be afraid to experiment.

Know about a python library that was left out? Share it in the comments below!

By Bruno Stecanella|August 3rd, 2017|News|0 Comments

时间: 2024-11-13 02:35:15

Python & 机器学习入门指导的相关文章

Python机器学习入门(1)之导学+无监督学习

Python Scikit-learn *一组简单有效的工具集 *依赖Python的NumPy,SciPy和matplotlib库 *开源 可复用 sklearn库的安装 DOS窗口中输入 pip install ** NumPy(开源科学计算库),SciPy(集成多种数学算法和函数模块)和matplotlib(提供大量绘图工具)库基础上开发的,因此需要先装这些依赖库 安装顺序 SKlearn库中的标准数据集及基本功能 波士顿房价数据集 使用sklearn.datasets.load_bosto

学习《Python机器学习(第2版)》中文PDF+英文PDF+代码分析+Sebastian

学习数据科学的初学者,想进一步拓展对数据科学领域的认知,推荐学习<Python机器学习(第二版)>.<Python机器学习(第二版)>将机器学习背后的基本理论与应用实践联系起来,聚焦于如何正确地提出问题.解决问题,能帮助了解如何使用Python解决数据中的关键问题. <Python机器学习(第2版)>介绍如何用Python和基于Python的机器学习软件库进行实践,对机器学习概念的必要细节进行讨论,同时对机器学习算法的工作原理.使用方法以及如何避免掉入常见的陷阱提供直观

python机器学习必读书籍:机器学习系统设计

所属网站分类: 资源下载 > python电子书 作者:从新来过 链接: http://www.pythonheidong.com/blog/article/441/ 来源:python黑洞网 www.pythonheidong.com 内容简介 如今,机器学习正在互联网上下掀起热潮,而Python则是非常适合开发机器学习系统的一门优秀语言.作为动态语言,它支持快速探索和实验,并且针对Python的机器学习算法库的数量也与日俱增.本书最大的特色,就是结合实例分析教会读者如何通过机器学习解决实际问

老司机学python篇:第一季(基础速过、机器学习入门)

详情请交流  QQ  709639943 00.老司机学python篇:第一季(基础速过.机器学习入门) 00.Python 从入门到精通 78节.2000多分钟.36小时的高质量.精品.1080P高清视频教程!包括标准库.socket网络编程.多线程.多进程和协程. 00.Django实战之用户认证系统 00.Django实战之企业级博客 00.深入浅出Netty源码剖析 00.NIO+Netty5各种RPC架构实战演练 00.JMeter 深入进阶性能测试体系 各领域企业实战 00.30天搞

Python-Day-01 Python入门指导

Python介绍 Python优缺点 优点:易学.可移植性.解释型语言.面向对象.丰富的库 缺点:强制缩进.速度不如C.无法加密.不能使用多线程 Python版本 python2.X python3.x python 2.7兼容版本,停止更新库 python3.x新版本,目前流行版本,持续更新维护库 Python安装 windows安装:https://www.python.org/getit/下载安装python,注意解决环境变量问题,继续安装Pycharm客户端,进行python编译调试 l

【机器学习】Python 快速入门笔记

Python 快速入门笔记 Xu An   2018-3-7  1.Python print #在Python3.X中使用print()进行输出,而2.x中使用()会报错 print("hello world")  print('I\'m apple')  #如果全部使用单引号,则需要在前面加上转义字符\+引号 print('apple'+'pear') print('apple'+str(4)) #将数字转换为字符串并打印 print(int("1")+2)#将字

《Python机器学习及实践:从零开始通往Kaggle竞赛之路》

<Python 机器学习及实践–从零开始通往kaggle竞赛之路>很基础 主要介绍了Scikit-learn,顺带介绍了pandas.numpy.matplotlib.scipy. 本书代码基于python2.x.不过大部分可以通过修改print()来适应python3.5.x. 提供的代码默认使用 Jupyter Notebook,建议安装Anaconda3. 最好是到https://www.kaggle.com注册账号后,运行下第四章的代码,感受下. 监督学习: 2.1.1分类学习(Cla

Python机器学习为啥就一下子红遍全球了呢???

在这篇文章中我们会讲Python的重要特征和它适用于机器学习的原因,介绍一些重要的机器学习包,以及其他你可以获取更详细资源的地方.为什么用Python做机器学习 Python很适合用于机器学习.首先,它很简单.如果你完全不熟悉Python但是有一些其他的编程经验(C或者其他编程语言),要上手是很快的.其次,Python的社区很强大.这使得Python的文档不仅条理性好,而且容易读.你还可以在StackOverFlow上找到关于很多问题详细解答(学习基石).再次,一个强大的社区带来的副产品就是大量

分享《自然语言处理理论与实战》PDF及代码+唐聃+《深入浅出Python机器学习》PDF及代码+段小手+《深度学习实践:计算机视觉》PDF+缪鹏+《最优化理论与算法第2版》高清PDF+习题解答PDF+《推荐系统与深度学习》PDF及代码学习

<自然语言处理理论与实战>高清PDF,362页,带书签目录,文字可以复制:配套源代码.唐聃等著. <大数据智能互联网时代的机器学习和自然语言处理技术>PDF,293页,带书签目录,文字可以复制,彩色配图.刘知远等著.  下载: https://pan.baidu.com/s/1waP6C086-32_Lv0Du3BbNw 提取码: 1ctr <自然语言处理理论与实战>讲述自然语言处理相关学科知识和理论基础,并介绍使用这些知识的应用和工具,以及如何在实际环境中使用它们.由