- Feb 2019
-
gitee.com gitee.com
-
nlp.stanford.edu nlp.stanford.edu
-
GloVe: Global Vectors for Word Representation
Tags
Annotators
URL
-
-
www.jmlr.org www.jmlr.org
-
Natural Language Processing (Almost) from Scratch
-
-
-
Short Text Similarity with Word Embeddings
-
-
arxiv.org arxiv.org
Tags
Annotators
URL
-
-
arxiv.org arxiv.org
Tags
Annotators
URL
-
-
gitee.com gitee.com
-
Text Understanding from Scratch
-
-
gitee.com gitee.com
-
Understanding Short Texts∗
-
-
gitee.com gitee.com
-
gitee.com gitee.com
-
gitee.com gitee.com
-
gitee.com gitee.com
-
arxiv.org arxiv.org
-
Attention Is All You Need
Tags
Annotators
URL
-
-
cs.stanford.edu cs.stanford.edu
-
Distributed Representations of Sentences and Documents - Doc2Vec
Tags
Annotators
URL
-
-
tanthiamhuat.files.wordpress.com tanthiamhuat.files.wordpress.com
-
aclanthology.org aclanthology.org
-
Deep contextualized word representations
Tags
Annotators
URL
-
-
gitee.com gitee.com
-
-
Bag of Tricks for Efficient Text Classification-fasttext
-
-
gist.github.com gist.github.comeliza.py1
Tags
Annotators
URL
-
-
arxiv.org arxiv.org
-
BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding
-
-
Tags
Annotators
URL
-
-
static.googleusercontent.com static.googleusercontent.com
-
stackoverflow.com stackoverflow.com
-
Efficient way to loop over Pandas Dataframe to make dummy variables (1 or 0 input)
dummy encoding
-
-
book.haihome.top book.haihome.top
-
Engineering Challenges
Communication 通常的存储都是kv,更新粒度是单个数值,但是ML算法通常的数据集类型是matrix,vector,tensor,更新的是part matrix或者vector,所以可以更进一步优化通信数据类型。
Fault tolerance
-
-
gitee.com gitee.com
-
gitee.com gitee.com
-
arxiv.org arxiv.org
-
-
gitee.com gitee.com
-
To generate a response, the dialogue manager follows a three-stepprocedure. First, it uses all response models to generate a set of candidate responses. Second, if thereexists apriorityresponse in the set of candidate responses (i.e. a response which takes precedenceover other responses), this response will be returned by the system.5For example, for the question"What is your name?", the response"I am an Alexa Prize socialbot"is a priority response. Third, ifthere are nopriorityresponses, the response is selected by themodel selection policy. For example,themodel selection policymay select a response by scoring all candidate responses and picking thehighest-scored response.
对话管理器是如何工作的
- 1.各个响应模型先生成各自的回复
- 2.如果有个“优先”回复,直接返回
- 3.如果没有“优先”回复,则根据model selection 策略进行选择
-
There are 22 response models in the system, including retrieval-based neural networks, generation-based neural networks, knowledge base question answering systems and template-based systems.Examples of candidate model responses are shown in Tabl
基于搜索的,基于生成的,知识问答和基于模版的混合应答模型
-
Early work on dialogue systems (Weizenbaum 1966, Colby 1981, Aust et al. 1995, McGlashan et al.1992, Simpson & Eraser 1993) were based mainly on states and rules hand-crafted by human experts.Modern dialogue systems typically follow a hybrid architecture, combining hand-crafted states andrules with statistical machine learning algorithms (Suendermann-Oeft et al. 2015, Jurˇcíˇcek et al.2014, Bohus et al. 2007, Williams 2011).
早期的主要是基于专家规则和状态的。现代对话系统更多的是一个混合的架构。
-
We present MILABOT: a deep reinforcement learning chatbot developed by theMontreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prizecompetition. MILABOT is capable of conversing with humans on popular smalltalk topics through both speech and text. The system consists of an ensemble ofnatural language generation and retrieval models, including template-based models,bag-of-words models, sequence-to-sequence neural network and latent variableneural network models. By applying reinforcement learning to crowdsourced dataand real-world user interactions, the system has been trained to select an appropriateresponse from the models in its ensemble. The system has been evaluated throughA/B testing with real-world users, where it performed significantly better thanmany competing systems. Due to its machine learning architecture, the system islikely to improve with additional data
-
-
ruder.io ruder.io
-
10 Exciting Ideas of 2018 in NLP
2018 年NLP 10大奇观
- 1 无监督机器翻译
- 2 预训练的语言模型
- 3 常识推理数据集
- 4 meta-learning
- 5 健壮的无监督方法
- 6 真正理解表示学习中的表示,预训练的语言模型真的表现出了和图像模型类似的功能
- 7 多任务学习的成功应用
- 8 半监督和迁移学习的结合
- 9 基于大量文本的QA和推理数据集
- 10 归纳偏差
-
Phrase-Based & Neural Unsupervised Machine Translation (EMNLP 2018): T
Tags
Annotators
URL
-
-
-
What I will call the model (M) which is our previous neural net. It can now be seen as a low-level network. It is sometimes called an optimizee or a learner. The weights of the model are the ■ on the drawings.The optimizer (O) or meta-learner is a higher-level model which is updating the weights of the lower-level network (the model). The weights of the optimizer are the ★ on the drawings
learner关于猫狗分类的模型
meta-learner,optimizer(o)来描述分类模型参数的模型
-
From zero to research — An introduction to Meta-learning
Meta-learning
what happens when we train a simple neural net to classify images of dogs and cats.
https://cdn-images-1.medium.com/max/1000/1*T5ppr8fb0chwz0wC7oYPDA.gif
-
-
gitee.com gitee.com
-
10ExcitingIdeasof2018inNLP
18年NLP大事记
-
- May 2018
-
book.haihome.top book.haihome.top
-
Thesemodels are often shared globally by all worker nodes,which must frequently accesses the shared parameters asthey perform computation to refine it.
传统模型训练流程在超大数据集上分布式训练时其模型需要所有节点都有才能训练。那么由此带来的问题:
- 网络带宽的巨大消耗
- 很多算法是顺序的。这种同步的训练方式很低效
- 分布式的话容错性差
-
- Apr 2018
-
book.haihome.top book.haihome.top
-
Algorithms Using Map-Reduce
mr的算法应用场景:
- 1 矩阵乘法,如pageRank。M*V 如果V能fit mem,怎么做?如果不能又该怎么做?
- 2 代数关系操作 SQL->
- select,
- projection,
- Union,intersection,difference
- Natural join
- grouping and aggregation
window函数解决的是row子集的计算,但是col子集的计算却是没有相应的函数的么?
-
Finding Similar Items
- 1相似定义和问题转换。txt simi -> shingling. minhashing -> set simi
- 2 too many -> locality sensity hashing
- 3 ..
-
Algorithms Using Map-Reduce
mr 的算法应用
- 1 矩阵乘法,太大了就横竖切
- 2 关系型计算。SQL
- selection
- projection
- natural join
- set op, intersection, difference,union
- grouping and aggregation
-
Distributed File Systems
分布式存储是分布式计算的底层支持。主要考虑的失败类型是单个节点的失败。
解决方法主要是从两个方面: 1 存储必须要由冗余 2 计算必须是分成task
-
Bonferroni’s Principle
随机事件发生的数量会随着数据增多而增加。所以要进行B修正。
-
Computational Approaches to Modeling
计算建模方法: 1 简要的近似的总结数据 2 抽取对数据最具主导的特征忽略其他。
Summarization: 1 pageRank 2 cluster
FeatureExtraction 1 frequent Itemset 2 Similar items
Tags
Annotators
URL
-
-
www.deeplearningbook.org www.deeplearningbook.org
-
Dynamic Structure
动态结构
- 1 神经网络模型内部的动态结构有时也叫做条件计算。就是选择计算哪些神经节点
- 2 另一个方向则是级联模型。有两种做法:第一种是高通量模型在前,高精度模型在后。还有方法是训练一堆独立的低通量模型,整体实现高通量。
决策树就是最典型的级联模型。和DL结合最简单的方法就是每个节点一个NN。
此外还有基于gater的NN选择mixture of experts方法。
另外一种动态结构是基于上下文的route机制,也就是注意力机制。
-
Specialized Hardware Implementations of Deep Network
定制化硬件来实现深度网络。 其实主要是深入分析训练和预测时计算的差别。前者需要精度较高,而inference则相对不高。
-
These large models learn some functionf(x), but do so using many moreparameters than are necessary for the task.
模型大的原因是比较少的数据量。其比较大的规模也就对小数据有用。所以一旦确定模型ok。就可以生成无限的样本来学到真实的参数比较少的f.
-
Large-Scale Distributed Implementations
数据并行:将数据分片,然后同样模型去跑 模型并行:多个模型不同部分对同一个数据进行计算
数据并行的方法相对更难。因为SGD的梯度计算是依赖上一步的权重的。那么为了解决这个问题Bengio提出了一种异步随机梯度下降算法。每个计算core独立的计算自己的梯度然后更新到参数上。参数无锁,所以很高效。但是这样单步更新后的参数有可能被覆盖掉。可是因为这种方法的总体梯度产生的速率比较快因此会让学习过程整体上更快了。
随后进一步基于共享内存发展出了parameter server。
-
GPU Implementations
- 1 GPU初始定位决定了其高并行,高内存带宽,低主频
- 2 GP-GPU通用GPU的出现才使DL真正流行
- 3 GPU编程比较难。pylearn2, Theano,cuda-convet固化常用算法包.Tensorflow,Torch
-
Fast CPU Implementations
- 1 fixed-point vs floating-point, 有时前者能带来更大提升
- 2 优化数据结构避免cache miss
- 3 向量指令集
-
Deep learning is based on the philosophy of connectionism
单个无用,成千上万就厉害了
-
-
www.deeplearningbook.org www.deeplearningbook.org
-
Selecting Hyperparameters
超参调整
超参调整主要有两种方式:人工和自动。
人工调参需要对超参数,训练误差,泛化误差和计算资源都有比较好的理解。 人工调参的目标是主要目标是在有限的资源下找到和任务复杂性匹配的有效模型容量。主要有三个方面的限制:
- 模型容量
- 优化算法
- 正则化
-
Determining Whether to Gather More Data
如何确定是否要更多的数据?
其实很多人最直接的方法是一下尝试很多的模型。这也ok。
- 首先要确定模型表现在训练集是否ok? 如果模型表现很差,模型没能充分学习好训练集,那其实重点要增加下模型容量,更多的layer和units。或者learning rate。
- 训练集如果ok的话,就看看在测试集上的效果。如果也ok的话那就不用更多数据了。如果很差,其实就是过拟合了,那就需要更多的数据。
那加多少数据够呢?可以画下数据量和泛化误差之间的关系图。一般来说连续两次的数据量double下会比较好。
-
design process
书中说了很多的算法及公式。但是实际情况下重点是比较好的应用。如何使用的话Andrew有个比较好的建议:
- 1 确定目标,用什么error评价标准
- 2 快速构建一个完整端到端的流程,并对运行有一定的预期
- 3 确定瓶颈点在哪儿,算法,数据还是什么
- 4 反复做点增量式的修改
-
Duringday-to-day development of machine learning systems, practitioners need to decidewhether to gather more data, increase or decrease model capacity, add or removeregularizing features, improve the optimization of a model, improve approximateinference in a model, or debug the software implementation of the mod
要想成功的应用深度模型,不仅要懂模型自身,还要知道怎么处理实验反馈的结果。是该准备更多的数据,增大还是降低模型容量,去掉正则特征等。
-
- Aug 2016
-
books.flexibleplan.com books.flexibleplan.com
-
TREE BOOSTING IN A NUTSHELL
分割点,节点权重计算,最优树结构,但是少了单棵树的权重计算?还是不需要?目前看来是直接线性相加的加法模型
-
-
help.ubuntu.com help.ubuntu.com
-
WakeOnLan
通过网络启动,很好玩的
-
-
pandas.pydata.org pandas.pydata.org
-
pandas.get_dummies
dummy encoding
-
-
scikit-learn.org scikit-learn.org
-
Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?¶
multiprocessing
-
-
stackoverflow.com stackoverflow.com
-
Multiprocessing scikit-learn
并行
-
-
www.quora.com www.quora.com
-
Does scikit-learn support parallelism?
并行
-
-
blog.csdn.net blog.csdn.net
-
sql的优化相关问题
这个不错
-
-
askubuntu.com askubuntu.com
-
How can I install Windows after I've installed Ubuntu?
安装ubuntu后安装win7
-
-
www.jetbrains.com www.jetbrains.com
-
To modify an existing template
idea code snippet
-
-
stackoverflow.com stackoverflow.com
-
Shortcuts for testing out small code snippets in IntelliJ IDEA?
code snippet
-
-
stackoverflow.com stackoverflow.com
-
What exactly is Spring for?
什么是java spring?
-
-
stackoverflow.com stackoverflow.com
-
Java Web Application Tutorial for complete beginner [closed]
java web
-
-
stackoverflow.com stackoverflow.com
-
There are two key differences between imap/imap_unordered and map/map_async:
不同map的对比
-
-
stackoverflow.com stackoverflow.com
-
N = pool.map(partial(func, b=second_arg), a_args)
pool map partial
-
62 down vote accepted My initial thought was to use partial, and as J.F. Sebastian indicated, partial works in this instance in Python >=2.7, so I am posting this, with the caveat that it won't work in 2.6. Also note that in the above code, you're passing the result of harvester(text, case) instead of the function harvester itself. Also, you aren't returning anything; you'll have to return something in order for this to be useful. I'm assuming that text is the variable that should be mapped, while case supplies the mapping function with extra information about the whole sequence. This simply maps each element in case to case[i] + case[0]. That's a bit different from what you did, but I find this example clearer: from functools import partial def harvester(text, case): X = case[0] return text + str(X) partial_harvester = partial(harvester, case=RAW_DATASET) if __name__ == '__main__': pool = multiprocessing.Pool(processes=6) case_data = RAW_DATASET pool.map(partial_harvester, case_data, 1) pool.close() pool.join()
python multiprocess
-
-
kayousterhout.github.io kayousterhout.github.io
-
Spark Performance Analysis
spark性能分析
-
-
-
Project Tungsten: Bringing Apache Spark Closer to Bare Metal
IO不是核心问题,shuffle过程中的序列化和hash是CPU瓶颈
-
-
databricks.com databricks.com
-
Introducing Apache Spark Datasets
Spark Dataset
-
-
stackoverflow.com stackoverflow.com
-
How do I load a file from resource folder?
java resource load
-
-
databricks.com databricks.com
-
An introduction to JSON support in Spark SQL
spark sql
-
-
stackoverflow.com stackoverflow.com
-
scala, guidelines on return type - when prefer seq, iterable, traversable
这些类型怎么弄
-
-
stackoverflow.com stackoverflow.com
-
How to read files from resources folder in Scala? var ados = ados || {}; ados.run = ados.run || []; ados.run.push(function () { ados_add_placement(22,8277,"adzerk172466415",4).setZone(43); });
scala读取项目内的资源文件
-
-
stackoverflow.com stackoverflow.com
-
“Large data” work flows using pandas
pandas怎么处理大量的数据情况
-
-
wesmckinney.com wesmckinney.com
-
High performance database joins with pandas DataFrame, more benchmarks
pandas的速度还可以
-
-
www.dabeaz.com www.dabeaz.com
-
Copyright (C) 2009, David Beazley,http://www.dabeaz.comA Curious Course on Coroutines and Concurrency
python coroutine
-
-
www.binpress.com www.binpress.com
-
Simple Python parallelism
python 并行话
-
-
github.com github.com
-
BUG: DataFrame outer merge changes key columns from int64 to float64
原来outer join类型转换是个bug
-
-
stackoverflow.com stackoverflow.com
-
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
为什么会出现原始的是int,join后变成float,空值的处理
-
-
stackoverflow.com stackoverflow.com
-
How to drop rows of Pandas dataframe whose value of certain column is NaN
pandas缺失值处理
-
-
stackoverflow.com stackoverflow.com
-
df[['two', 'three']] = df[['two', 'three']].astype(float)
这个才是真的!
-
You can use pd.to_numeric (introduced in version 0.17) to convert a column or a Series to a numeric type. The function can also be applied over multiple columns of a DataFrame using apply. Importantly, the function also takes an errors key word argument that lets you force not-numeric values to be NaN, or simply ignore columns containing these values.
pandas change column type
-
-
stackoverflow.com stackoverflow.com
-
18 down vote accepted As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful. My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls: df.groupby('A').transform(lambda x: (x['C'] - x['D'])) df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean()) You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.
python pandas group by transformation and apply!文档太他妈扯了!
-
-
stackoverflow.com stackoverflow.com
-
transform is not that well documented, but it seems that the way it works is that what the transform function is passed is not the entire group as a dataframe, but a single column of a single group. I don't think it's really meant for what you're trying to do, and your solution with apply is fine. So suppose tips.groupby('smoker').transform(func). There will be two groups, call them group1 and group2. The transform does not call func(group1) and func(group2). Instead, it calls func(group1['total_bill']), then func(group1['tip']), etc., and then func(group2['total_bill']), func(group2['total_bill']). Here
python pandas group by文档太他妈难看啦!
-
- Jul 2016
-
charliepark.org charliepark.org
-
Tags In Jekyll
Jeklly 标签功能
-
-
brizzled.clapper.org brizzled.clapper.org
-
Interesting Synergy trick
同一套鼠标在不同主机和屏幕的切换
-
-
blog.kaggle.com blog.kaggle.com
-
The Machine Learning Framework
kaggle blog 带图框架
-
-
ipython.org ipython.org
-
In [1]: %load_ext autoreload In [2]: %autoreload 2 In [3]: from foo import some_function
python auto reload
-
-
stackoverflow.com stackoverflow.com
-
import site; site.getsitepackages()
python packages path
-
-
stackoverflow.com stackoverflow.com
-
down vote accepted http://dev.mysql.com/doc/refman/5.1/en/alter-table.html ALTER TABLE tablename MODIFY columnname INTEGER;
修改列类型
-
-
worktile.com worktile.com
-
宜佳
店铺名字
-
-
stackoverflow.com stackoverflow.com
-
How do I see what character set a MySQL database / table / column is?
查看编码
-
-
blog.csdn.net blog.csdn.net
-
MySQL会出现中文乱码
有点用
-
-
askubuntu.com askubuntu.com
-
Adding timestamps to terminal prompts?
添加时间戳
-
-
askubuntu.com askubuntu.com
-
How to disable the “unlock your keyring” popup?
关掉这个的好方法
-
-
-
A Quick Guide to Using the MySQL APT Repository
SQL install Guide
-
-
cs231n.github.io cs231n.github.io
-
Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Networks from Overfitting (pdf) that complements the other methods (L1, L2, maxnorm).
非常简单有效的正则化方法
-
-
arxiv.org arxiv.org
-
Do Deep Nets Really Need to be Deep?***Draft for NIPS 2014 (not camera ready copy)
神经网络真的需要很深么
-
-
cs231n.github.io cs231n.github.io
-
As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.
实际中通常3层比2层效果好很多,更多层没什么效果,但是对卷积神经网络来说是层数越多越好
-
Representational power
NN的表现能力:只要一个Hidden层的神经网络模型就能趋近任何连续函数
-
N-layer neural network, we do not count the input laye
么有计算输入层
-
he most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.
全连接神经网络模型,两层之间的神经元所有都两两连接,同层内的神经元没有连接。
-
TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout.
神经元的选择,优先尝试ReLU,如果死的单元太多,就试试LeakyReLU,Maxout
-
-
neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com
-
One of the most striking facts about neural networks is that they can compute any function at all.
Try
-
-
homes.cs.washington.edu homes.cs.washington.edudga.ps1
-
A Few Useful Things to Know about Machine Learning
很实用的机器学习概念理解
-
-
cs231n.github.io cs231n.github.io
-
Intuitive understanding of backpropagation
直观理解反向传播,电路中的“门”
-
-
cs231n.github.io cs231n.github.io
-
Multiclass SVM loss for the i-th example is then formalized as follows
why use the sum max of (sj-syi),what's the intuition behind this formula? should this be absolute value?
-
-
www.cs.toronto.edu www.cs.toronto.edu
-
Dropout: A Simple Way to Prevent Neural Networks fromOvertting
需要看看
-
-
inst-fs-iad-prod.inscloudgate.net inst-fs-iad-prod.inscloudgate.net
-
There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.
卷积网络的四个核心优势:局部连接,权重共享,集中合成和多层的利用。
-
half-spaces sepa-rated by a hyperplane19.
传统算法的局限,在图像和语音领域,需要对不相干的钝感和对几个很小地方差异的敏感
-
Deep learning
四大金刚中的三个
-
The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
深度学习的最重要的一方面就是多层特征自动学习
-
most practitioners use a procedure called stochastic gradient descent (SGD).
随机梯度下降算法,讲的很好
-
, The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed.
我擦!原来如此!!!
-
The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives.
反向传播过程来计算一个具有多层模块权重的目标函数的梯度其实不过是求导链式规则实际应用。
-
-
mysrc.sinaapp.com mysrc.sinaapp.com
-
pip install -i http://mirrors.aliyun.com/pypi/simple pymysql
python阿里云的pip源
-
-
clonezilla.org clonezilla.org
-
Clonezilla The Free and Open Source Software for Disk Imaging and Cloning
磁盘工具
-
-
github.com github.com
-
amyhaber/cnki-downloader
CNKI文献下载工具
-
-
en.wikipedia.org en.wikipedia.org
-
Run-length encoding (RLE)
数据无损压缩编码
Tags
Annotators
URL
-
-
www.zhihu.com www.zhihu.com
-
根据评论区 @山丹丹@啸王 的提醒,更正了一些错误(用斜体显示),在此谢谢各位。并根据自己最近的理解,增添了一些东西(用斜体显示)。如果还有错误,欢迎大家指正。第一个问题:为什么引入非线性激励函数?如果不用激励函数(其实相当于激励函数是f(x) = x),在这种情况下你每一层输出都是上层输入的线性函数,很容易验证,无论你神经网络有多少层,输出都是输入的线性组合,与没有隐藏层效果相当,这种情况就是最原始的感知机(Perceptron)了。正因为上面的原因,我们决定引入非线性函数作为激励函数,这样深层神经网络就有意义了(不再是输入的线性组合,可以逼近任意函数)。最早的想法是sigmoid函数或者tanh函数,输出有界,很容易充当下一层输入(以及一些人的生物解释balabala)。第二个问题:为什么引入Relu呢?第一,采用sigmoid等函数,算激活函数时(指数运算),计算量大,反向传播求误差梯度时,求导涉及除法,计算量相对大,而采用Relu激活函数,整个过程的计算量节省很多。第二,对于深层网络,sigmoid函数反向传播时,很容易就会出现梯度消失的情况(在sigmoid接近饱和区时,变换太缓慢,导数趋于0,这种情况会造成信息丢失,参见 @Haofeng Li 答案的第三点),从而无法完成深层网络的训练。第三,Relu会使一部分神经元的输出为0,这样就造成了网络的稀疏性,并且减少了参数的相互依存关系,缓解了过拟合问题的发生(以及一些人的生物解释balabala)。当然现在也有一些对relu的改进,比如prelu,random relu等,在不同的数据集上会有一些训练速度上或者准确率上的改进,具体的大家可以找相关的paper看。多加一句,现在主流的做法,会在做完relu之后,加一步batch normalization,尽可能保证每一层网络的输入具有相同的分布[1]。而最新的paper[2],他们在加入bypass connection之后,发现改变batch normalization的位置会有更好的效果。大家有兴趣可以看下。
ReLU的好
Tags
Annotators
URL
-