百度宣布开源人工智能软件代码供所有从业者分享

清新de花 · 发表于 2016-1-16 10:39:55

百度宣布开源人工智能软件代码供所有从业者分享

2016年01月15日 18:09:08来源：新华科技

据百度硅谷人工智能实验室（SVAIL）官方1月15日消息，百度已开源关键人工智能 (AI) 软件 Warp-CTC，公开了关键代码。对研究人员来说，该软件可用于解决绘制输入序列到输出序列图谱过程中的监督问题，如语音识别问题。

据悉，Warp-CTC是百度前期为了在最新的计算机芯片上更快速运行而专门研发的一种改良版深度学习算法。百度硅谷实验室目前已向GitHub上传了Warp-CTC C代码库，鼓励开发者试用这些代码。百度表示，代码将开放给所有从业者。

CTC（链结式时间分类算法）方法始于2006年，在瑞士AI实验室IDSIA的论文中有所描述。CTC结合了多个不同的神经网络设计，以处理不完美的数据集。百度即在此基础上开发了Warp-CTC，用于提升语音识别能力。

百度方面称，SVAIL工程师在打造端对端语音识别系统时开发了Warp-CTC，目的是要通过CTC来改善培训模型的可扩展性。“我们发现，目前可用的CTC技术通常需要更多的内存和，或是几十到几百倍的减速。”

百度表示，希望此次开源能促使端到端的深度学习变得更简单、速度更快，加快研究者的进度，进而对机器学习领域的进步做出贡献。

据了解，部分代码被用于开发一款强大的深度语音识别系统Deep Speech 2。对于一些简短的句子，该系统比大多数人类更善于正确地识别语音。该技术使百度数亿用户可以更好的访问其服务，尤其是在移动端。在智能手机上输入汉字较为复杂，目前中国很多人已经习惯用语音来发送短信或在网上搜索信息。

深度学习使计算机可以执行各种“用脑”的学习型操作，如精致地转录语音或识别物体图像。也就是说，一个大型模拟神经网络中导入特定字词的音频或特定物体的图像，随着时间推移，此网络将不断“学习”以识别几乎任何新的例子。

众多技术企业正在竞相免费开放各自的深度学习代码，旨在鼓励研究者与创业者们踊跃开发能够兼容各自技术的机器学习系统，最终使其系统生态更加完善。开源会引起人们对创新更多的兴趣和热情，使得相关的技术发展进入一个良性循环稳步前进。

谷歌、Facebook不久之前都开源了相关的软件平台，随着百度在人工智能领域的开源，知识共享领域将激发出更多创新，开发者能获取更丰富的技术学习途径，促进开发者量体裁衣的进行技术开发。

qiansi506 · 发表于 2016-2-22 16:51:15

不服不行，楼主就是有水平

清新de花 · 发表于 2016-1-21 08:42:29

哈，还是大版主厉害，直接去看源代码了。

morinson · 发表于 2016-1-16 11:48:58

warp-ctcA fast parallel implementation of CTC, on both CPU and GPU.
IntroductionConnectionist Temporal Classificationis a loss function useful for performing supervised learning on sequence data,without needing an alignment between input data and labels. For example, CTCcan be used to trainend-to-end systems forspeech recognition,which is how we have been using it at Baidu's Silicon Valley AI Lab.

The illustration above shows CTC computing the probability of an outputsequence "THE CAT ", as a sum over all possible alignments of input sequencesthat could map to "THE CAT ", taking into account that labels may be duplicatedbecause they may stretch over several time steps of the input data (represented bythe spectrogram at the bottom of the image).Computing the sum of all such probabilities explicitly would be prohibitively costly due to thecombinatorics involved, but CTC uses dynamic programming to dramaticallyreduce the complexity of the computation. Because CTC is a differentiable function,it can be used during standard SGD training of deep neural networks.
In our lab, we focus on scaling up recurrent neural networks, and CTC loss is animportant component. To make our system efficient, we parallelized the CTCalgorithm, as described in this paper.This project contains our high performance CPU and CUDA versions of the CTC loss,along with bindings for Torch.The library provides a simple C interface, so that it is easy tointegrate into deep learning frameworks.
This implementation has improved training scalability beyond theperformance improvement from a faster parallel CTC implementation. ForGPU-focused training pipelines, the ability to keep all data local toGPU memory allows us to spend interconnect bandwidth on increased dataparallelism.
PerformanceOur CTC implementation is efficient compared with many of the other publicly available implementations. It isalso written to be as numerically stable as possible. The algorithm is numerically sensitive and we have observedcatastrophic underflow even in double precision with the standard calculation - the result of division of two numbers on the order of 1e-324 which should have been approximately one, instead become infinity when the denominator underflowed to 0. Instead, by performing the calculation in log space, it is numericallystable even in single precision floating point at the cost of significantly more expensive operations. Instead ofone machine instruction, addition requires the evaluation of multiple transcendental functions. Because of this,the speed of CTC implementations can only be fairly compared if they are both performing the calculation the sameway.
We compare our performance with Eesen, a CTC implementation built on Theano,and a Cython CPU only implementation Stanford-CTC.We benchmark the Theano implementation operating on 32-bit floating-point numbers and doing the calculation in log-space,in order to match the other implementations we compare against. Stanford-CTC was modified to perform the calculationin log-space as it did not support it natively. It also does not support minibatches larger than 1, so would requirean awkward memory layout to use in a real training pipeline, we assume linear increase in cost with minibatch size.
We show results on two problem sizes relevant to our English and Mandarin end-to-end models, respectively, where T represents the number of timesteps in the input to CTC, L represents the length of the labels for each example, and A represents the alphabet size.
On the GPU, our performance at a minibatch of 64 examples ranges from 7x faster to 155x faster than Eesen, and 46x to 68x faster than the Theano implementation.
GPU PerformanceBenchmarked on a single NVIDIA Titan X GPU.
[td]

T=150, L=40, A=28	warp-ctc	Eesen	Theano
N=1	3.1 ms	.5 ms	67 ms
N=16	3.2 ms	6 ms	94 ms
N=32	3.2 ms	12 ms	119 ms
N=64	3.3 ms	24 ms	153 ms
N=128	3.5 ms	49 ms	231 ms

[td]

T=150, L=20, A=5000	warp-ctc	Eesen	Theano
N=1	7 ms	40 ms	120 ms
N=16	9 ms	619 ms	385 ms
N=32	11 ms	1238 ms	665 ms
N=64	16 ms	2475 ms	1100 ms
N=128	23 ms	4950 ms	2100 ms

CPU PerformanceBenchmarked on a dual-socket machine with two Intel E5-2660 v3processors - warp-ctc used 40 threads to maximally take advantage of the CPU resources.Eesen doesn't provide a CPU implementation. We noticed that the Theano implementation was notparallelizing computation across multiple threads. Stanford-CTC provides no mechanismfor parallelization across threads.
[td]

T=150, L=40, A=28	warp-ctc	Stanford-CTC	Theano
N=1	2.6 ms	13 ms	15 ms
N=16	3.4 ms	208 ms	180 ms
N=32	3.9 ms	416 ms	375 ms
N=64	6.6 ms	832 ms	700 ms
N=128	12.2 ms	1684 ms	1340 ms

[td]

T=150, L=20, A=5000	warp-ctc	Stanford-CTC	Theano
N=1	21 ms	31 ms	850 ms
N=16	37 ms	496 ms	10800 ms
N=32	54 ms	992 ms	22000 ms
N=64	101 ms	1984 ms	42000 ms
N=128	184 ms	3968 ms	86000 ms

InterfaceThe interface is ininclude/ctc.h.It supports CPU or GPU execution, and you can specify OpenMP parallelismif running on the CPU, or the CUDA stream if running on the GPU. Wetook care to ensure that the library does not perform memoryallocation internally, in order to avoid synchronizations andoverheads caused by memory allocation.
Compilationwarp-ctc has been tested on Ubuntu 14.04 and OSX 10.10.  Windows is not supportedat this time.
First get the code:
git clonecd warp-ctccreate a build directory:
mkdir buildcd buildif you have a non standard CUDA install export CUDA_BIN_PATH=/path_to_cuda so that CMake detects CUDA andto ensure Torch is detected, make sure th is in $PATH
run cmake and build:
cmake ../makeThe C library and torch shared libraries should now be built along with testexecutables.  If CUDA was detected, then test_gpu will be built; test_cpuwill always be built.
TestsTo run the tests, make sure the CUDA libraries are in LD_LIBRARY_PATH (DYLD_LIBRARY_PATH for OSX).
The Torch tests must be run from the torch_binding/tests/ directory.
Torch Installationluarocks make torch_binding/rocks/warp-ctc-scm-1.rockspec
You can also install without cloning the repository using
luarocks install http://raw.githubusercontent.com ... -ctc-scm-1.rockspec
There is a Torch CTC tutorial.
ContributingWe welcome improvements from the community, please feel free to submit pullrequests.
Known Issues  / LimitationsThe CUDA implementation requires a device of at least compute capability 3.0.
The CUDA implementation supports a maximum label length of 639 (timesteps areunlimited).

morinson · 发表于 2016-1-16 11:45:37

在github上搜了下，确实看到了项目的源代码

有兴趣的同学可以去看看https://github.com/baidu-research/warp-ctc.git

		自动登录	找回密码
密码			立即注册

百度宣布开源人工智能软件代码 供所有从业者分享

站长推荐 /1

百度宣布开源人工智能软件代码供所有从业者分享