word2vec的学习思路

飞林沙 飞林沙 2013-08-27 06:01:46
赵贾森
2013-08-27 17:18:49 赵贾森

你这篇今天的日记,已经在百度搜word2vec的结果前十了,恭喜。

holys
2013-08-29 13:24:33 holys (hit)

百度结果排在第二了

横刀天笑
2013-08-30 14:30:27 横刀天笑 (最近有什么好书可以推荐看看啊?)

我擦,太崇拜你了

凭栏望北斗
2013-09-01 15:49:17 凭栏望北斗 (念念不忘,必有回响。)

……好厉害的样子……

ZDD
2013-09-03 17:47:37 ZDD

越来越爱豆瓣了

xie41
2013-09-05 15:24:54 xie41

啊,还是看不懂啊

xie41
2013-09-05 17:25:55 xie41

求更详细的讲解,那篇论文都没讲些啥

飞林沙
2013-09-05 17:33:01 飞林沙 (此账号吐槽的方法已被注释)
求更详细的讲解,那篇论文都没讲些啥 求更详细的讲解,那篇论文都没讲些啥 xie41

我只是给个思路啊~ 再都是数学推导了,沿着我的思路一个一个论文往下看吧~

xie41
2013-09-06 11:11:17 xie41

5中的那篇文章都没有具体说他的优化目标是啥,整个模型也没有解释清楚,1-2的文章我都看了,也觉得比较了解了,难道我不理解的地方就在3跟4的文章里?可是我稍微浏览了下3跟4的文章好像跟5的模型也不是很相关啊

飞林沙
2013-09-06 11:20:56 飞林沙 (此账号吐槽的方法已被注释)
5中的那篇文章都没有具体说他的优化目标是啥,整个模型也没有解释清楚,1-2的文章我都看了,也觉 5中的那篇文章都没有具体说他的优化目标是啥,整个模型也没有解释清楚,1-2的文章我都看了,也觉得比较了解了,难道我不理解的地方就在3跟4的文章里?可是我稍微浏览了下3跟4的文章好像跟5的模型也不是很相关啊 ... xie41

如果你看代码就发现2+3就是5的优化目标~

xie41
2013-09-06 12:33:52 xie41
如果你看代码就发现2+3就是5的优化目标~ 如果你看代码就发现2+3就是5的优化目标~ 飞林沙

好的,谢谢了

air_bob
2013-09-13 10:59:17 air_bob

unigram sampling 的时候用的power是0.75, 请问楼主这个0.75有什么特殊的地方么,为什么要用0.75, 有什么相关的论文么,谢谢了~ @飞林沙

胡小姐不姓胡
2013-09-19 03:32:44 胡小姐不姓胡 (小女子 有大爱)

不上大学果然是没有未来的 = = 。。。

呵呵
2013-09-19 11:52:39 呵呵
不上大学果然是没有未来的 = = 。。。 不上大学果然是没有未来的 = = 。。。 胡小姐不姓胡

ehehhehe

呵呵
2013-09-19 11:52:56 呵呵

我是来学习WORD的

hillbird
2013-09-22 19:18:15 hillbird

google排名第二

苹果籽
2013-09-22 20:54:47 苹果籽 (如果得不偿失,那么一切就此结束!)

你的左右脑均匀生长的吧,呵呵..

huangguandxf
2013-09-22 21:20:44 huangguandxf

g = (1 - vocab[word].code[d] - f) * alpha;这个梯度是怎么计算的,有人可以解释一下吗

huangguandxf
2013-09-27 16:56:31 huangguandxf

代码中的loss function是交叉熵,不是楼主说的平方误差,其它地方楼主都是对的,感谢楼主啊!

阑珊
2013-09-27 20:19:23 阑珊

现在想起来,真感谢当年的老师,这门课是开卷考的。。神经网络这玩意,真心理解不上去

[已注销]
2013-10-04 17:05:55 [已注销]

马克 前几个月刚看过LDA 理解尚浅

michael
2013-10-10 11:55:22 michael

之前关于word2vec代码的学习的公式有一些疑问,在工具的论坛上找到了答案,可能大家也用得到。
https://groups.google.com/forum/#!topic/word2vec-toolkit/KcT1kpBmJnU

On Monday, September 23, 2013 11:28:52 PM UTC-4, iamhua...@gmail.com wrote:
> // 'g' is the gradient multiplied by the learning rate
> g = (1 - vocab[word].code[d] - f) * alpha;
> Could you show me how to get it?
>
> My question is:
> 1.Why the desired_output is 1 - vocab[word].code[d]?Why not vocab[word].code[d]?
This looks like an arbitrary decision to me. If you use the original binary code digit, you would get similar parameters out, just negated.

> 2.We use gradient descent to update sysn1[], so we should get the derivative of Loss Function with respect to syn1[].So we should mutiply g by f*(1-f)(f*(1-f) is the derivative of sigmod function).
The likelihood function involves a product of sigmoids, one for each binary "decision point" along the Huffman tree. Since the gradient of products is usually difficult to work with, it's typical to take gradients of the log of the likelihood, which is a summation of log-sigmoids.

If we call the binary code digit x, the parameters for a context word phi, and the parameters for an internal node in the Huffman tree theta, then the log of the sigmoid is

log(L) = (1-x) * phi^T theta - log(1 + exp(phi^T theta)).

The derivative of this expression, repeated over the full path, will produce the gradient updates in the code.

michael
2013-10-10 11:56:38 michael

之前关于word2vec代码的学习的公式有一些疑问,在工具的论坛上找到了答案,可能大家也用得到。
https://groups.google.com/forum/#!topic/word2vec-toolkit/KcT1kpBmJnU

On Monday, September 23, 2013 11:28:52 PM UTC-4, iamhua...@gmail.com wrote:
> // 'g' is the gradient multiplied by the learning rate
> g = (1 - vocab[word].code[d] - f) * alpha;
> Could you show me how to get it?
>
> My question is:
> 1.Why the desired_output is 1 - vocab[word].code[d]?Why not vocab[word].code[d]?
This looks like an arbitrary decision to me. If you use the original binary code digit, you would get similar parameters out, just negated.

> 2.We use gradient descent to update sysn1[], so we should get the derivative of Loss Function with respect to syn1[].So we should mutiply g by f*(1-f)(f*(1-f) is the derivative of sigmod function).
The likelihood function involves a product of sigmoids, one for each binary "decision point" along the Huffman tree. Since the gradient of products is usually difficult to work with, it's typical to take gradients of the log of the likelihood, which is a summation of log-sigmoids.

If we call the binary code digit x, the parameters for a context word phi, and the parameters for an internal node in the Huffman tree theta, then the log of the sigmoid is

log(L) = (1-x) * phi^T theta - log(1 + exp(phi^T theta)).

The derivative of this expression, repeated over the full path, will produce the gradient updates in the code.

michael
2013-10-10 11:57:26 michael

之前关于word2vec代码的学习的公式有一些疑问,在工具的论坛上找到了答案,可能大家也用得到。
https://groups.google.com/forum/#!topic/word2vec-toolkit/KcT1kpBmJnU

On Monday, September 23, 2013 11:28:52 PM UTC-4, iamhua...@gmail.com wrote:
> // 'g' is the gradient multiplied by the learning rate
> g = (1 - vocab[word].code[d] - f) * alpha;
> Could you show me how to get it?
>
> My question is:
> 1.Why the desired_output is 1 - vocab[word].code[d]?Why not vocab[word].code[d]?
This looks like an arbitrary decision to me. If you use the original binary code digit, you would get similar parameters out, just negated.

> 2.We use gradient descent to update sysn1[], so we should get the derivative of Loss Function with respect to syn1[].So we should mutiply g by f*(1-f)(f*(1-f) is the derivative of sigmod function).
The likelihood function involves a product of sigmoids, one for each binary "decision point" along the Huffman tree. Since the gradient of products is usually difficult to work with, it's typical to take gradients of the log of the likelihood, which is a summation of log-sigmoids.

If we call the binary code digit x, the parameters for a context word phi, and the parameters for an internal node in the Huffman tree theta, then the log of the sigmoid is

log(L) = (1-x) * phi^T theta - log(1 + exp(phi^T theta)).

The derivative of this expression, repeated over the full path, will produce the gradient updates in the code.

[已注销]
2013-10-10 14:32:41 [已注销]

发现里面好多没有头像的人

迎客松
2013-10-21 22:23:47 迎客松

word2vec真心恶心,这么些论文零零散散

迎客松
2013-10-21 22:26:19 迎客松

是不是之所以这么多论文,因为加个小trick发一篇,再加个小trick再发一篇。模型都大同小异

香神无涯
2013-10-28 14:21:37 香神无涯

百度结果排在第一了

Adam
2013-12-09 11:48:53 Adam (Vampire~)

太强了!我发现我的学习思路和lz一样,虽然BP神经网络与Topic Model的HDP、LDA只是粗略接触,但是为了学习Word2Vec,我也是这个顺序找到了这几个论文,然后怒刷了一遍。突然又搜到了lz的豆瓣日志,感觉真是英雄所见略同!

kalviny
2014-02-11 09:47:06 kalviny (成功进化为标准理工科爷)

还是稍微搞下latex吧。。。

安眠
2014-02-11 22:59:30 安眠

好牛

alisy_zhu
2014-02-24 17:51:10 alisy_zhu

太厉害了,赞一个!

l_sy@Echo
2014-03-05 10:03:01 l_sy@Echo

真心强悍,大赞

奋斗的北漂人
2014-03-15 21:39:48 奋斗的北漂人

真心强大啊!

langlanglofa
2014-03-23 09:59:23 langlanglofa

楼主,可以将你修改过的word2vec工程代码上传一下啊!!!!

Sharon
2014-04-18 08:32:54 Sharon (后来)

太厉害了……真的太厉害了……都不知道要怎么学习

sharp
2014-07-10 13:54:51 sharp (做最好的自己)

mark 点赞

jason
2015-04-25 12:12:52 jason

mark下,表示大专生很难看懂。。。

jinxiao晨
2015-11-25 21:11:15 jinxiao晨

问一下,关于negative sampling,对于一个给定的词W,生成的的负采样样本中只包含一个词吗

鹤啸九天
2015-12-29 18:19:05 鹤啸九天

不太懂

疯人院_
2016-12-29 21:08:08 疯人院_

我觉得是不是应该加一篇《Distributed Representations of Words and Phrases and their Compositionality》?


飞林沙
飞林沙 (广东深圳)

记录一些话以随时警醒自己: A 看起来很有力的吐槽,其实真的会很有力地...