一, 获取文本语料库
一个文本语料库是一大段文本。它通常包含多个单独的文本,但为了处理方便,我们把他们头尾连接起来当做一个文本对待。
1. 古腾堡语料库
nltk包含古腾堡项目(Project Gutenberg)电子文本档案的一小部分文本。要使用该语料库通常需要用Python解释器加载nltk包,然后尝试nltk.corpus.gutenberg.fileids().实例如下:
1 >>> import nltk 2 >>> nltk.corpus.gutenberg.fileids() 3 [‘austen-emma.txt‘, ‘austen-persuasion.txt‘, ‘austen-sense.txt‘, ‘bible-kjv.txt‘ 4 , ‘blake-poems.txt‘, ‘bryant-stories.txt‘, ‘burgess-busterbrown.txt‘, ‘carroll-a 5 lice.txt‘, ‘chesterton-ball.txt‘, ‘chesterton-brown.txt‘, ‘chesterton-thursday.t 6 xt‘, ‘edgeworth-parents.txt‘, ‘melville-moby_dick.txt‘, ‘milton-paradise.txt‘, ‘ 7 shakespeare-caesar.txt‘, ‘shakespeare-hamlet.txt‘, ‘shakespeare-macbeth.txt‘, ‘w 8 hitman-leaves.txt‘] 9 >>>
运行结果显示的是nltk包含了该语料库的哪些文本。我们可以对其中的任意文本进行操作。
1)统计词数。实例如下:
1 >>> emma = nltk.corpus.gutenberg.words(‘austen-emma.txt‘) 2 >>> len(emma) 3 192427 4 >>>
2)索引文本。实例如下:
1 >>> emma = nltk.Text(nltk.corpus.gutenberg.words(‘austen-emma.txt‘)) 2 >>> emma.concordance("surprise") 3 Displaying 1 of 1 matches: 4 that Emma could not but feel some surprise , and a little displeasure , on he 5 >>>
3)获取文本的标识符,词,句。实例如下:
279 >>> for fileid in gutenberg.fileids(): 280 ... raw = gutenberg.raw(fileid) 281 ... num_chars = len(raw) 282 ... words = gutenberg.words(fileid) 283 ... num_words = len(words) 284 ... sents = gutenberg.sents(fileid) 285 ... num_sents = len(sents) 286 ... vocab = set([w.lower() for w in gutenberg.words(fileid)]) 287 ... num_vocab = len(vocab) 288 ... print("%d %d %d %s" % (num_chars, num_words, num_sents, fileid)) 289 ... 290 887071 192427 7752 austen-emma.txt 291 466292 98171 3747 austen-persuasion.txt 292 673022 141576 4999 austen-sense.txt 293 4332554 1010654 30103 bible-kjv.txt 294 38153 8354 438 blake-poems.txt 295 249439 55563 2863 bryant-stories.txt 296 84663 18963 1054 burgess-busterbrown.txt 297 144395 34110 1703 carroll-alice.txt 298 457450 96996 4779 chesterton-ball.txt 299 406629 86063 3806 chesterton-brown.txt 300 320525 69213 3742 chesterton-thursday.txt 301 935158 210663 10230 edgeworth-parents.txt 302 1242990 260819 10059 melville-moby_dick.txt 303 468220 96825 1851 milton-paradise.txt 304 112310 25833 2163 shakespeare-caesar.txt 305 162881 37360 3106 shakespeare-hamlet.txt 306 100351 23140 1907 shakespeare-macbeth.txt 307 711215 154883 4250 whitman-leaves.txt 308 309 >>> raw[:1000] 310 "[Leaves of Grass by Walt Whitman 1855]\n\n\nCome, said my soul,\nSuch verses fo 311 r my Body let us write, (for we are one,)\nThat should I after return,\nOr, long 312 , long hence, in other spheres,\nThere to some group of mates the chants resumin 313 g,\n(Tallying Earth‘s soil, trees, winds, tumultuous waves,)\nEver with pleas‘d 314 smile I may keep on,\nEver and ever yet the verses owning--as, first, I here and 315 now\nSigning for Soul and Body, set to them my name,\n\nWalt Whitman\n\n\n\n[BO 316 OK I. INSCRIPTIONS]\n\n} One‘s-Self I Sing\n\nOne‘s-self I sing, a simple sepa 317 rate person,\nYet utter the word Democratic, the word En-Masse.\n\nOf physiology 318 from top to toe I sing,\nNot physiognomy alone nor brain alone is worthy for th 319 e Muse, I say\n the Form complete is worthier far,\nThe Female equally with t 320 he Male I sing.\n\nOf Life immense in passion, pulse, and power,\nCheerful, for 321 freest action form‘d under the laws divine,\nThe Modern Man I sing.\n\n\n\n} As 322 I Ponder‘d in Silence\n\nAs I ponder‘d in silence,\nReturning upon my poems, c" 323 >>> 324 >>> words 325 [‘[‘, ‘Leaves‘, ‘of‘, ‘Grass‘, ‘by‘, ‘Walt‘, ‘Whitman‘, ...] 326 >>> sents 327 [[‘[‘, ‘Leaves‘, ‘of‘, ‘Grass‘, ‘by‘, ‘Walt‘, ‘Whitman‘, ‘1855‘, ‘]‘], [‘Come‘, 328 ‘,‘, ‘said‘, ‘my‘, ‘soul‘, ‘,‘, ‘Such‘, ‘verses‘, ‘for‘, ‘my‘, ‘Body‘, ‘let‘, ‘u 329 s‘, ‘write‘, ‘,‘, ‘(‘, ‘for‘, ‘we‘, ‘are‘, ‘one‘, ‘,)‘, ‘That‘, ‘should‘, ‘I‘, ‘ 330 after‘, ‘return‘, ‘,‘, ‘Or‘, ‘,‘, ‘long‘, ‘,‘, ‘long‘, ‘hence‘, ‘,‘, ‘in‘, ‘othe 331 r‘, ‘spheres‘, ‘,‘, ‘There‘, ‘to‘, ‘some‘, ‘group‘, ‘of‘, ‘mates‘, ‘the‘, ‘chant 332 s‘, ‘resuming‘, ‘,‘, ‘(‘, ‘Tallying‘, ‘Earth‘, "‘", ‘s‘, ‘soil‘, ‘,‘, ‘trees‘, ‘ 333 ,‘, ‘winds‘, ‘,‘, ‘tumultuous‘, ‘waves‘, ‘,)‘, ‘Ever‘, ‘with‘, ‘pleas‘, "‘", ‘d‘ 334 , ‘smile‘, ‘I‘, ‘may‘, ‘keep‘, ‘on‘, ‘,‘, ‘Ever‘, ‘and‘, ‘ever‘, ‘yet‘, ‘the‘, ‘ 335 verses‘, ‘owning‘, ‘--‘, ‘as‘, ‘,‘, ‘first‘, ‘,‘, ‘I‘, ‘here‘, ‘and‘, ‘now‘, ‘Si 336 gning‘, ‘for‘, ‘Soul‘, ‘and‘, ‘Body‘, ‘,‘, ‘set‘, ‘to‘, ‘them‘, ‘my‘, ‘name‘, ‘, 337 ‘], ...]
raw表示的是文本中所有的标识符,words是词,sents是句子。显然句子都是划分成一个个词来进行存储的。除了words(), raw() 和 sents()以外,大多数nltk语料库阅读器还包括多种访问方法。
2. 网络和聊天文本
古腾堡项目包含的是成千上万的书籍,它们比较正式,代表了既定的文学。除此之外, nltk中还有很多的网络文本小集合,其内容包括Firefox交流论坛,在纽约无意中听到的对话,《加勒比海盗》的电影剧本,个人广告和葡萄酒的评论。访问该部分的文本实例如下:
1 >>> for fileid in webtext.fileids(): 2 ... print("%s %s ..." % (fileid, webtext.raw(fileid)[:65])) 3 ... 4 firefox.txt Cookie Manager: "Don‘t allow sites that set removed cookies to se 5 ... 6 grail.txt SCENE 1: [wind] [clop clop clop] 7 KING ARTHUR: Whoa there! [clop ... 8 overheard.txt White guy: So, do you have any plans for this evening? 9 Asian girl ... 10 pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN‘S CHEST, by Ted Elliott & Terr 11 ... 12 singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun 13 ... 14 wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ... 15 16 >>>
3. 即时消息聊天会话语料库
该语料库最初是由美国海军研究生院为研究自动检测互联网入侵者而收集的,包含超过1000个帖子,被分成15个文件,每个文件包含几百个从特定日期和特定年龄的聊天室收集的帖子。文件名包含日期,聊天室和帖子的数量。引用实例如下:
4.布朗语料库
布朗语料库是第一个百万词级的英语电子语料库,其中包含500个不同来源的文本,按照文体分类,如新闻,社论等。它主要用于研究文体之间的系统性差异(又叫做文体学的语言学研究)。我们可以将语料库作为词链表或者句子链表来访问。
1)按特定类别或文件阅读
1 >>> from nltk.corpus import brown 2 >>> brown.categories() 3 [‘adventure‘, ‘belles_lettres‘, ‘editorial‘, ‘fiction‘, ‘government‘, ‘hobbies‘, 4 ‘humor‘, ‘learned‘, ‘lore‘, ‘mystery‘, ‘new‘, ‘news‘, ‘religion‘, ‘reviews‘, ‘r 5 omance‘, ‘science_fiction‘] 6 >>> brown.words(categories=‘news‘) 7 [‘The‘, ‘Fulton‘, ‘County‘, ‘Grand‘, ‘Jury‘, ‘said‘, ...] 9 >>> brown.words(fileids=[‘cg22‘]) 10 [‘Does‘, ‘our‘, ‘society‘, ‘have‘, ‘a‘, ‘runaway‘, ‘,‘, ...] 11 >>> brown.sents(categories=[‘news‘, ‘editorial‘, ‘reviews‘, ]) 12 [[‘The‘, ‘Fulton‘, ‘County‘, ‘Grand‘, ‘Jury‘, ‘said‘, ‘Friday‘, ‘an‘, ‘investiga 13 tion‘, ‘of‘, "Atlanta‘s", ‘recent‘, ‘primary‘, ‘election‘, ‘produced‘, ‘``‘, ‘no 14 ‘, ‘evidence‘, "‘‘", ‘that‘, ‘any‘, ‘irregularities‘, ‘took‘, ‘place‘, ‘.‘], [‘T 15 he‘, ‘jury‘, ‘further‘, ‘said‘, ‘in‘, ‘term-end‘, ‘presentments‘, ‘that‘, ‘the‘, 16 ‘City‘, ‘Executive‘, ‘Committee‘, ‘,‘, ‘which‘, ‘had‘, ‘over-all‘, ‘charge‘, ‘o 17 f‘, ‘the‘, ‘election‘, ‘,‘, ‘``‘, ‘deserves‘, ‘the‘, ‘praise‘, ‘and‘, ‘thanks‘, 18 ‘of‘, ‘the‘, ‘City‘, ‘of‘, ‘Atlanta‘, "‘‘", ‘for‘, ‘the‘, ‘manner‘, ‘in‘, ‘which 19 ‘, ‘the‘, ‘election‘, ‘was‘, ‘conducted‘, ‘.‘], ...] 20 >>>
2)比较不同文体之间情态动词的用法
1 >>> from nltk.corpus import brown 2 >>> news_text = brown.words(categories=‘news‘) 3 >>> fdist=nltk.FreqDist([w.lower() for w in news_text]) 4 >>> modals = [‘can‘, ‘could‘, ‘may‘, ‘might‘, ‘must‘, ‘will‘] 5 >>> for m in modals: 6 ... print("%s:%d" %(m, fdist[m])) 7 ... 8 can:94 9 could:87 10 may:93 11 might:38 12 must:53 13 will:389 14 >>>
5. 路透社语料库