Word Embeddings: Encoding Lexical Semantics

Word Embeddings in Pytorch

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

Out:

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward>)

An Example: N-Gram Language Modeling

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty‘s field,
Thy youth‘s proud livery so gazed on now,
Will be a totter‘d weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv‘d thy beauty‘s use,
If thou couldst answer ‘This fair child of mine
Shall sum my count, and make my old excuse,‘
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel‘st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]

vocab = set(test_sentence) #the element in set is distinct
word_to_ix = {word: i for i, word in enumerate(vocab)}

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        model.zero_grad()

        log_probs = model(context_idxs)

        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    losses.append(total_loss)
print(losses)

Exercise: Computing Word Embeddings: Continuous Bag-of-Words

CONTEXT_SIZE=2
raw_text= """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix={word:i for i,word in enumerate(vocab)}
data=[]
for i in range(2,len(raw_text)-2):
    context=[raw_text[i-2],raw_text[i-1],raw_text[i+1],raw_text[i+2]]
    target=raw_text[i]
    data.append((context,target))
print(data[:5])

class CBOW(nn.Module):
    def __init__(self):
        pass

    def forward(self,inputs):
        pass

def make_context_vector(context,word_to_ix):
    idxs=[word_to_ix[w] for w in context]
    return torch.tensor(idxs,dtype=torch.long)

make_context_vector(data[0][0],word_to_ix)

原文地址：https://www.cnblogs.com/czhwust/p/wordembeddings.html

时间： 2024-10-08 15:26:24

Word Embeddings: Encoding Lexical Semantics的相关文章

Word Embeddings: Encoding Lexical Semantics（译文）

词向量:编码词汇级别的信息 url:http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html?highlight=lookup 词嵌入词嵌入是稠密向量,每个都代表了一个单词表里面的一个单词.NLP中每个Feature都是单词,但是怎么在电脑中表示单词呢?? ascii知识告诉我们每个单词是啥,没告诉我们是什么意思.还有就是,怎么融合这些表示呢? 第一步:通过one-hot编码.w=[0,0,1,0,0].其中

翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings

翻译 | Improving Distributional Similarity with Lessons Learned from Word Embeddings 叶娜老师说:"读懂论文的最好方法是翻译它".我认为这是很好的科研训练,更加适合一个陌生领域的探索.因为论文读不懂,我总结无非是因为这个领域不熟悉.如果是自己熟悉的领域,那么读起来肯定会比较顺畅. 原文摘要 [1] Recent trends suggest that neural-network-inspired wor

Dependency-Based Word Embeddings（基于依存的词向量）

最近要开始读论文了,其实自己读论文的能力挺不怎么样的,并且读过就忘记,这实在是让人很不爽的事情.自己分析记不住的原因可以有以下几点: 读论文时理解就不深刻,有时候就是一知半解的读完之后没有总结,即没有自己概括这篇论文的过程,所以文中一知半解的过程还是忽略了,并且以后再回顾的时候,这篇论文对自己来说就像新的论文一样,还是一样懵. 所以,我决定对读的每一篇论文都做一个总结,并发表在博客上.如果有人能强忍着"这人写了些什么玩意"的想法看完了我的文章,还请不吝赐教,指出我的错误. 作为开始总

Coursera, Deep Learning 5, Sequence Models, week2, Natural Language Processing & Word Embeddings

Word embeding 给word 加feature,用来区分word 之间的不同,或者识别word之间的相似性. 因为t-SNE 做了non-liner 的转化,所以在原来的3000维空间的平行的向量在转化过后的2D空间里基本上不会再平行. 看两个向量的相似性,可以用cosine similarity. 原文地址:https://www.cnblogs.com/mashuai-191/p/8977909.html

Word Embeddings

1 自然语言处理系统通常将词汇作为离散的单一符号,例如 "cat" 一词或可表示为 Id537 ,而 "dog" 一词或可表示为 Id143.这些符号编码毫无规律,无法提供不同词汇之间可能存在的关联信息.换句话说,在处理关于 "dogs" 一词的信息时,模型将无法利用已知的关于 "cats" 的信息(例如,它们都是动物,有四条腿,可作为宠物等等).可见,将词汇表达为上述的独立离散符号将进一步导致数据稀疏,使我们在训练统计模型时

Word 2007 XML 解压缩格式

简介 Microsoft Office Word 2007提供了一种新的默认文件格式,叫做Microsoft Office Word XML格式(Word XML格式).这种格式基于开放打包约定(Open Packaging Conventions),XML Paper Specification (XPS)也是基于这个约定.Microsoft Office 97到Microsoft Office 2003中使用的二进制文件格式仍然可以作为一种保存格式来使用,但是它不是保存新文档时的默认文档.

【DeepLearning】一些资料

记录下,有空研究. http://nlp.stanford.edu/projects/DeepLearningInNaturalLanguageProcessing.shtml http://nlp.stanford.edu/courses/NAACL2013/ Fast and Robust Neural Network Joint Models for Statistical Machine Translation ACL2014的论文列表 http://blog.sina.com.cn/s

斯坦福CS课程列表

http://exploredegrees.stanford.edu/coursedescriptions/cs/ CS 101. Introduction to Computing Principles. 3-5 Units. Introduces the essential ideas of computing: data representation, algorithms, programming "code", computer hardware, networking, s

ACL 2015 selected paper

ACL 2015 selected paper 概述(1) 开完 ACL 2015 大会,选了自己感兴趣的几十篇论文,大部分是自己已经读过的,做了一些概述.相信里面有很多错误,欢迎指正.另外,图文并茂版本在公众号查看,长微博复制图片也许有很多错误显示不出来. 1. Text to 3D Scene Generation with Rich Lexical Grounding Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, C