336 Matching Annotations
  1. Feb 2019
    1. BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding

    1. Engineering Challenges
      • Communication 通常的存储都是kv,更新粒度是单个数值,但是ML算法通常的数据集类型是matrix,vector,tensor,更新的是part matrix或者vector,所以可以更进一步优化通信数据类型。

      • Fault tolerance

    1. To generate a response, the dialogue manager follows a three-stepprocedure. First, it uses all response models to generate a set of candidate responses. Second, if thereexists apriorityresponse in the set of candidate responses (i.e. a response which takes precedenceover other responses), this response will be returned by the system.5For example, for the question"What is your name?", the response"I am an Alexa Prize socialbot"is a priority response. Third, ifthere are nopriorityresponses, the response is selected by themodel selection policy. For example,themodel selection policymay select a response by scoring all candidate responses and picking thehighest-scored response.

      对话管理器是如何工作的

      • 1.各个响应模型先生成各自的回复
      • 2.如果有个“优先”回复,直接返回
      • 3.如果没有“优先”回复,则根据model selection 策略进行选择
    2. There are 22 response models in the system, including retrieval-based neural networks, generation-based neural networks, knowledge base question answering systems and template-based systems.Examples of candidate model responses are shown in Tabl

      基于搜索的,基于生成的,知识问答和基于模版的混合应答模型

    3. Early work on dialogue systems (Weizenbaum 1966, Colby 1981, Aust et al. 1995, McGlashan et al.1992, Simpson & Eraser 1993) were based mainly on states and rules hand-crafted by human experts.Modern dialogue systems typically follow a hybrid architecture, combining hand-crafted states andrules with statistical machine learning algorithms (Suendermann-Oeft et al. 2015, Jurˇcíˇcek et al.2014, Bohus et al. 2007, Williams 2011).

      早期的主要是基于专家规则和状态的。现代对话系统更多的是一个混合的架构。

    4. We present MILABOT: a deep reinforcement learning chatbot developed by theMontreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prizecompetition. MILABOT is capable of conversing with humans on popular smalltalk topics through both speech and text. The system consists of an ensemble ofnatural language generation and retrieval models, including template-based models,bag-of-words models, sequence-to-sequence neural network and latent variableneural network models. By applying reinforcement learning to crowdsourced dataand real-world user interactions, the system has been trained to select an appropriateresponse from the models in its ensemble. The system has been evaluated throughA/B testing with real-world users, where it performed significantly better thanmany competing systems. Due to its machine learning architecture, the system islikely to improve with additional data
    1. 10 Exciting Ideas of 2018 in NLP

      2018 年NLP 10大奇观

      • 1 无监督机器翻译
      • 2 预训练的语言模型
      • 3 常识推理数据集
      • 4 meta-learning
      • 5 健壮的无监督方法
      • 6 真正理解表示学习中的表示,预训练的语言模型真的表现出了和图像模型类似的功能
      • 7 多任务学习的成功应用
      • 8 半监督和迁移学习的结合
      • 9 基于大量文本的QA和推理数据集
      • 10 归纳偏差
    2. Phrase-Based & Neural Unsupervised Machine Translation (EMNLP 2018):  T
    1. What I will call the model (M) which is our previous neural net. It can now be seen as a low-level network. It is sometimes called an optimizee or a learner. The weights of the model are the ■ on the drawings.The optimizer (O) or meta-learner is a higher-level model which is updating the weights of the lower-level network (the model). The weights of the optimizer are the ★ on the drawings

      learner关于猫狗分类的模型

      meta-learner,optimizer(o)来描述分类模型参数的模型

    2. From zero to research — An introduction to Meta-learning

      Meta-learning

      what happens when we train a simple neural net to classify images of dogs and cats.

      https://cdn-images-1.medium.com/max/1000/1*T5ppr8fb0chwz0wC7oYPDA.gif

  2. May 2018
    1. Thesemodels are often shared globally by all worker nodes,which must frequently accesses the shared parameters asthey perform computation to refine it.

      传统模型训练流程在超大数据集上分布式训练时其模型需要所有节点都有才能训练。那么由此带来的问题:

      • 网络带宽的巨大消耗
      • 很多算法是顺序的。这种同步的训练方式很低效
      • 分布式的话容错性差
  3. Apr 2018
    1. Algorithms Using Map-Reduce

      mr的算法应用场景:

      • 1 矩阵乘法,如pageRank。M*V 如果V能fit mem,怎么做?如果不能又该怎么做?
      • 2 代数关系操作 SQL->
      • select,
      • projection,
      • Union,intersection,difference
      • Natural join
      • grouping and aggregation

      window函数解决的是row子集的计算,但是col子集的计算却是没有相应的函数的么?

    2. Finding Similar Items
      • 1相似定义和问题转换。txt simi -> shingling. minhashing -> set simi
      • 2 too many -> locality sensity hashing
      • 3 ..
    3. Algorithms Using Map-Reduce

      mr 的算法应用

      • 1 矩阵乘法,太大了就横竖切
      • 2 关系型计算。SQL
        • selection
        • projection
        • natural join
        • set op, intersection, difference,union
        • grouping and aggregation
    4. Distributed File Systems

      分布式存储是分布式计算的底层支持。主要考虑的失败类型是单个节点的失败。

      解决方法主要是从两个方面: 1 存储必须要由冗余 2 计算必须是分成task

    5. Bonferroni’s Principle

      随机事件发生的数量会随着数据增多而增加。所以要进行B修正。

    6. Computational Approaches to Modeling

      计算建模方法: 1 简要的近似的总结数据 2 抽取对数据最具主导的特征忽略其他。

      Summarization: 1 pageRank 2 cluster

      FeatureExtraction 1 frequent Itemset 2 Similar items

    1. Dynamic Structure

      动态结构

      • 1 神经网络模型内部的动态结构有时也叫做条件计算。就是选择计算哪些神经节点
      • 2 另一个方向则是级联模型。有两种做法:第一种是高通量模型在前,高精度模型在后。还有方法是训练一堆独立的低通量模型,整体实现高通量。

      决策树就是最典型的级联模型。和DL结合最简单的方法就是每个节点一个NN。

      此外还有基于gater的NN选择mixture of experts方法。

      另外一种动态结构是基于上下文的route机制,也就是注意力机制。

    2. Specialized Hardware Implementations of Deep Network

      定制化硬件来实现深度网络。 其实主要是深入分析训练和预测时计算的差别。前者需要精度较高,而inference则相对不高。

    3. These large models learn some functionf(x), but do so using many moreparameters than are necessary for the task.

      模型大的原因是比较少的数据量。其比较大的规模也就对小数据有用。所以一旦确定模型ok。就可以生成无限的样本来学到真实的参数比较少的f.

    4. Large-Scale Distributed Implementations

      数据并行:将数据分片,然后同样模型去跑 模型并行:多个模型不同部分对同一个数据进行计算

      数据并行的方法相对更难。因为SGD的梯度计算是依赖上一步的权重的。那么为了解决这个问题Bengio提出了一种异步随机梯度下降算法。每个计算core独立的计算自己的梯度然后更新到参数上。参数无锁,所以很高效。但是这样单步更新后的参数有可能被覆盖掉。可是因为这种方法的总体梯度产生的速率比较快因此会让学习过程整体上更快了。

      随后进一步基于共享内存发展出了parameter server。

    5. GPU Implementations
      • 1 GPU初始定位决定了其高并行,高内存带宽,低主频
      • 2 GP-GPU通用GPU的出现才使DL真正流行
      • 3 GPU编程比较难。pylearn2, Theano,cuda-convet固化常用算法包.Tensorflow,Torch
    6. Fast CPU Implementations
      • 1 fixed-point vs floating-point, 有时前者能带来更大提升
      • 2 优化数据结构避免cache miss
      • 3 向量指令集
    7. Deep learning is based on the philosophy of connectionism

      单个无用,成千上万就厉害了

    1. Selecting Hyperparameters

      超参调整

      超参调整主要有两种方式:人工和自动。

      人工调参需要对超参数,训练误差,泛化误差和计算资源都有比较好的理解。 人工调参的目标是主要目标是在有限的资源下找到和任务复杂性匹配的有效模型容量。主要有三个方面的限制:

      • 模型容量
      • 优化算法
      • 正则化
    2. Determining Whether to Gather More Data

      如何确定是否要更多的数据?

      其实很多人最直接的方法是一下尝试很多的模型。这也ok。

      • 首先要确定模型表现在训练集是否ok? 如果模型表现很差,模型没能充分学习好训练集,那其实重点要增加下模型容量,更多的layer和units。或者learning rate。
      • 训练集如果ok的话,就看看在测试集上的效果。如果也ok的话那就不用更多数据了。如果很差,其实就是过拟合了,那就需要更多的数据。

      那加多少数据够呢?可以画下数据量和泛化误差之间的关系图。一般来说连续两次的数据量double下会比较好。

    3. design process

      书中说了很多的算法及公式。但是实际情况下重点是比较好的应用。如何使用的话Andrew有个比较好的建议:

      • 1 确定目标,用什么error评价标准
      • 2 快速构建一个完整端到端的流程,并对运行有一定的预期
      • 3 确定瓶颈点在哪儿,算法,数据还是什么
      • 4 反复做点增量式的修改
    4. Duringday-to-day development of machine learning systems, practitioners need to decidewhether to gather more data, increase or decrease model capacity, add or removeregularizing features, improve the optimization of a model, improve approximateinference in a model, or debug the software implementation of the mod

      要想成功的应用深度模型,不仅要懂模型自身,还要知道怎么处理实验反馈的结果。是该准备更多的数据,增大还是降低模型容量,去掉正则特征等。

  4. Aug 2016
    1. TREE BOOSTING IN A NUTSHELL

      分割点,节点权重计算,最优树结构,但是少了单棵树的权重计算?还是不需要?目前看来是直接线性相加的加法模型

    1. Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?¶

      multiprocessing

    1. N = pool.map(partial(func, b=second_arg), a_args)

      pool map partial

    2. 62 down vote accepted My initial thought was to use partial, and as J.F. Sebastian indicated, partial works in this instance in Python >=2.7, so I am posting this, with the caveat that it won't work in 2.6. Also note that in the above code, you're passing the result of harvester(text, case) instead of the function harvester itself. Also, you aren't returning anything; you'll have to return something in order for this to be useful. I'm assuming that text is the variable that should be mapped, while case supplies the mapping function with extra information about the whole sequence. This simply maps each element in case to case[i] + case[0]. That's a bit different from what you did, but I find this example clearer: from functools import partial def harvester(text, case): X = case[0] return text + str(X) partial_harvester = partial(harvester, case=RAW_DATASET) if __name__ == '__main__': pool = multiprocessing.Pool(processes=6) case_data = RAW_DATASET pool.map(partial_harvester, case_data, 1) pool.close() pool.join()

      python multiprocess

    1. Project Tungsten: Bringing Apache Spark Closer to Bare Metal

      IO不是核心问题,shuffle过程中的序列化和hash是CPU瓶颈

    1. How to read files from resources folder in Scala? var ados = ados || {}; ados.run = ados.run || []; ados.run.push(function () { ados_add_placement(22,8277,"adzerk172466415",4).setZone(43); });

      scala读取项目内的资源文件

    1. Copyright (C) 2009, David Beazley,http://www.dabeaz.comA Curious Course on Coroutines and Concurrency

      python coroutine

    1. BUG: DataFrame outer merge changes key columns from int64 to float64

      原来outer join类型转换是个bug

    1. NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:

      为什么会出现原始的是int,join后变成float,空值的处理

    1. df[['two', 'three']] = df[['two', 'three']].astype(float)

      这个才是真的!

    2. You can use pd.to_numeric (introduced in version 0.17) to convert a column or a Series to a numeric type. The function can also be applied over multiple columns of a DataFrame using apply. Importantly, the function also takes an errors key word argument that lets you force not-numeric values to be NaN, or simply ignore columns containing these values.

      pandas change column type

    1. 18 down vote accepted As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful. My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls: df.groupby('A').transform(lambda x: (x['C'] - x['D'])) df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean()) You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.

      python pandas group by transformation and apply!文档太他妈扯了!

    1. transform is not that well documented, but it seems that the way it works is that what the transform function is passed is not the entire group as a dataframe, but a single column of a single group. I don't think it's really meant for what you're trying to do, and your solution with apply is fine. So suppose tips.groupby('smoker').transform(func). There will be two groups, call them group1 and group2. The transform does not call func(group1) and func(group2). Instead, it calls func(group1['total_bill']), then func(group1['tip']), etc., and then func(group2['total_bill']), func(group2['total_bill']). Here

      python pandas group by文档太他妈难看啦!

  5. Jul 2016
    1. In [1]: %load_ext autoreload In [2]: %autoreload 2 In [3]: from foo import some_function

      python auto reload

    1. down vote accepted http://dev.mysql.com/doc/refman/5.1/en/alter-table.html ALTER TABLE tablename MODIFY columnname INTEGER;

      修改列类型

    1. Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Networks from Overfitting (pdf) that complements the other methods (L1, L2, maxnorm).

      非常简单有效的正则化方法

    1. Do Deep Nets Really Need to be Deep?***Draft for NIPS 2014 (not camera ready copy)

      神经网络真的需要很深么

    1. As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.

      实际中通常3层比2层效果好很多,更多层没什么效果,但是对卷积神经网络来说是层数越多越好

    2. Representational power

      NN的表现能力:只要一个Hidden层的神经网络模型就能趋近任何连续函数

    3. N-layer neural network, we do not count the input laye

      么有计算输入层

    4. he most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.

      全连接神经网络模型,两层之间的神经元所有都两两连接,同层内的神经元没有连接。

    5. TLDR: “What neuron type should I use?” Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout.

      神经元的选择,优先尝试ReLU,如果死的单元太多,就试试LeakyReLU,Maxout

    1. One of the most striking facts about neural networks is that they can compute any function at all.

      Try

  6. homes.cs.washington.edu homes.cs.washington.edu
    1. A Few Useful Things to Know about Machine Learning

      很实用的机器学习概念理解

    1. Intuitive understanding of backpropagation

      直观理解反向传播,电路中的“门”

    1. Multiclass SVM loss for the i-th example is then formalized as follows

      why use the sum max of (sj-syi),what's the intuition behind this formula? should this be absolute value?

    1. Dropout: A Simple Way to Prevent Neural Networks fromOver tting

      需要看看

  7. inst-fs-iad-prod.inscloudgate.net inst-fs-iad-prod.inscloudgate.net
    1. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

      卷积网络的四个核心优势:局部连接,权重共享,集中合成和多层的利用。

    2. half-spaces sepa-rated by a hyperplane19.

      传统算法的局限,在图像和语音领域,需要对不相干的钝感和对几个很小地方差异的敏感

    3. Deep learning

      四大金刚中的三个

    4. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

      深度学习的最重要的一方面就是多层特征自动学习

    5. most practitioners use a procedure called stochastic gradient descent (SGD).

      随机梯度下降算法,讲的很好

    6. , The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed.

      我擦!原来如此!!!

    7. The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives.

      反向传播过程来计算一个具有多层模块权重的目标函数的梯度其实不过是求导链式规则实际应用。

    1. pip install -i http://mirrors.aliyun.com/pypi/simple pymysql

      python阿里云的pip源

    1. Clonezilla The Free and Open Source Software for Disk Imaging and Cloning

      磁盘工具

    1. 根据评论区 @山丹丹@啸王 的提醒,更正了一些错误(用斜体显示),在此谢谢各位。并根据自己最近的理解,增添了一些东西(用斜体显示)。如果还有错误,欢迎大家指正。第一个问题:为什么引入非线性激励函数?如果不用激励函数(其实相当于激励函数是f(x) = x),在这种情况下你每一层输出都是上层输入的线性函数,很容易验证,无论你神经网络有多少层,输出都是输入的线性组合,与没有隐藏层效果相当,这种情况就是最原始的感知机(Perceptron)了。正因为上面的原因,我们决定引入非线性函数作为激励函数,这样深层神经网络就有意义了(不再是输入的线性组合,可以逼近任意函数)。最早的想法是sigmoid函数或者tanh函数,输出有界,很容易充当下一层输入(以及一些人的生物解释balabala)。第二个问题:为什么引入Relu呢?第一,采用sigmoid等函数,算激活函数时(指数运算),计算量大,反向传播求误差梯度时,求导涉及除法,计算量相对大,而采用Relu激活函数,整个过程的计算量节省很多。第二,对于深层网络,sigmoid函数反向传播时,很容易就会出现梯度消失的情况(在sigmoid接近饱和区时,变换太缓慢,导数趋于0,这种情况会造成信息丢失,参见 @Haofeng Li 答案的第三点),从而无法完成深层网络的训练。第三,Relu会使一部分神经元的输出为0,这样就造成了网络的稀疏性,并且减少了参数的相互依存关系,缓解了过拟合问题的发生(以及一些人的生物解释balabala)。当然现在也有一些对relu的改进,比如prelu,random relu等,在不同的数据集上会有一些训练速度上或者准确率上的改进,具体的大家可以找相关的paper看。多加一句,现在主流的做法,会在做完relu之后,加一步batch normalization,尽可能保证每一层网络的输入具有相同的分布[1]。而最新的paper[2],他们在加入bypass connection之后,发现改变batch normalization的位置会有更好的效果。大家有兴趣可以看下。

      ReLU的好