CSCI446/946

CSCI446/946 - Spring Session 2019 Page 1
University of Wollongong
School of Computing and Information Technology
CSCI446/946 Big Data Analytics Spring 2019
Assignment 2 (Due: 9 October 2019, Wednesday) 20 marks
Aim
This assignment is intended to provide basic experience in conducting text analytics experiments with R. After
having completed this assignment you should know how to perform text classification, topic modeling, and sentiment
analysis.
Preliminaries
Read through the lecture notes and recommended readings on text analysis. Study all example programs therein so
that you fully understand these techniques and know how to perform them with R.
Task 1 – Text Classification (6 marks)
The 20 Newsgroups data set is a benchmark for text classification. It consists of approximately 20,000 newsgroup

代写CSCI446/946作业、代做Python程序设计作业
documents, which have been categorised into 20 different newsgroups. Information on this dataset can be obtained
from the webpage http://qwone.com/~jason/20Newsgroups/ . Download the “20news-bydate-matlab.tgz” from
this webpage and unzip it to obtain the training and testing data sets. Train the Naïve Bayes classifier with the
training data set and test it on the testing data set.
In your report, you need to
1. Describe this 20 Newsgroups data set.
2. Describe how each document is represented in your implementation.
3. Describe Naïve Bayes classifier and how you use it to classify the 20 Newsgroups data set.
4. Report the classification accuracy and plot the confusion matrix.
5. Attach your code at the end of the report.
Task 2 – Topic Modeling (6 marks)
Perform LDA topic modeling on the Reuters-21578 corpus using R (or Python) and LDA. The NLTK has already
come with the Reuters-21578 corpus. To import this corpus, enter the following comment in the Python prompt:
from nltk.corpus import reuters
R comes with an lda package that has built-in functions. The LDA has also been implemented by several Python
libraries such as gensim. Either use one such package/library or implement your own LDA to perform topic
modeling on the Reuters-21578 corpus.
In your report, you need to
1. Describe the Reuters-21578 corpus.
2. Describe how each document is represented in your implementation.
3. Describe the whole procedure on applying LDA to this corpus to perform topic modeling.
4. Describe the parameter setting that you use in the LDA and explain their meanings.
5. Describe the output of your code and visualize the obtained topics in appropriate ways.
6. Attach your code at the end of the report.
Task 3 – Sentiment Analysis (8 marks)
Choose a topic of your interest, such as a movie, a celebrity, or any buzz word. Then collect 200 tweets related to this
topic. Hand-tag them as positive, neutral, or negative. Next, randomly split them into 150 tweets as the training set
and the remaining 50 as the testing set. Run one or more classifiers (such as Naïve Bayes, Maximum Entropy, or
Support Vector Machines) over these tweets to perform sentiment analysis. Report the classification accuracy and
CSCI446/946 - Spring Session 2019 Page 2
plot the confusion matrix. When you run more than one classifiers, find methods to evaluate which classifier
performs better than the others. (* It is not compulsory for the students of CSCI446 to run more than one
classifier.)
In your report, you need to
1. Describe the procedure of collecting the tweets and manually tagging them.
2. Describe the statistics of the obtained data set.
3. Describe how you represent each tweet for classification.
4. For each classifier, describe its working principle, classification procedure, and parameter setting.
5. For each classifier, report the classification accuracy and plot the confusion matrix.
6. (CSCI946 only) When you run more than one classifiers, report which classifier performs better than the
others and describe the methods you use to reach this conclusion.
7. Attach your code at the end of the report.
Submit:
Important:
1. The report must be in PDF format.
2. The report shall contain sufficient and detailed description, explanation, justification and
discussion. Marks will be deducted for a BRIEF report.
3. Sufficient annotation shall be provided in your code to make it easy to understand.
Neatly print your report and code (i.e. first the report then the code) on A4 pages with an appropriate cover sheet and
hand it in during the lecture on the 9th of October 2019. Make sure your report and code are correctly formatted and
titled. (Marks will be deducted for untidy or incorrectly formatted work.) Also, submit your report and the source
code in a Zipped file named A2.zip via the submit link provided in the Moodle site.
Note: Failure of your code to run may attract zero marks. Code or reports considered to be unreasonably same due to
copying will attract zero marks. You may be requested to demonstrate and explain your program when necessary.
Marks will be awarded for correct design, implementation and style. Any request for an extension of the submission
deadline or demonstration time limit must be made to the Subject Coordinator before the submission deadline.
Supporting documentation must accompany the request for any extension. Late assignment submissions without
granted extension will be marked but the mark awarded will be reduced by 25% of the assignment mark for each day
(including weekends) late.

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected]

微信:codehelp

原文地址:https://www.cnblogs.com/python34/p/11622616.html

时间: 2024-10-31 14:54:45

CSCI446/946的相关文章

Leetcode 946. Validate Stack Sequences 验证栈序列

946. Validate Stack Sequences 题目描述 Given two sequences pushed and popped with distinct values, return true if and only if this could have been the result of a sequence of push and pop operations on an initially empty stack. 示例 示例1 Input: pushed = [1,

(栈)leetcode 946. Validate Stack Sequences

Given two sequences pushed and popped with distinct values, return true if and only if this could have been the result of a sequence of push and pop operations on an initially empty stack. Example 1: Input: pushed = [1,2,3,4,5], popped = [4,5,3,2,1]

移动端点击事件全攻略,有你知道与不知道的各种坑

看标题的时候你可能会想,点击事件有什么好说的,还写一篇攻略?哈哈,如果你这么想,只能说明你too young to simple. 接触过移动端开发的同学可能都会面临点击事件的第一个问题:click事件的300ms延迟响应.不能立即响应给体验造成了很大的困扰,因此解决这个问题就成为了必然. 这个问题的解决方案就是: zepto.js的tap事件.tap事件可以理解为在移动端的click事件,而zepto.js因为几乎完全复制jQuery的api,因此常常被用在h5的开发上用来取代jquery.

STM32F4XX高效驱动篇1-UART

之前一直在做驱动方面的整理工作,对驱动的高效性有一些自己的理解这里和大家分享一下.并奉驱动程序,本程序覆盖uart1-8. 串口驱动,这是在每个单片机中可以说是必备接口.可以说大部分产品中都会使用,更有甚者一个产品中用到8个串口.这样一个高效的驱动是决定您产品优劣的关键因素.本文主要针对STM32F4XX系列芯片做的一个驱动接口层.以减少您在开发项目时驱动方面所花费时间,以及为程序达到高效的处理为目的. 从51,pic到现在的STM32,个人感觉STM32这方面做的非常突出,丰富的使用模式,强大

CSS实现的几款不错的菜单栏

前言 自从做了智慧城市这个项目之后,我一个做后端的开发者,瞬间转为前端开发,不过我还是很喜欢前端的.前端那些事,其实蛮有意思的,HTML实现的是静态的,使用ajax之后就可以和数据库交互了,加上js和jQuery之后就动起来了,加上CSS之后就更加炫酷了.因为项目中需要,查资料和编写了一些炫酷的二级菜单,分享给大家,好东西就要分享嘛! 一.滑动菜单 1.代码: 1 <!DOCTYPE HTML> 2 <html lang="en-US"> 3 <head&

DirectShowNet 使用摄像头录像+录音

http://www.cnblogs.com/endv/p/6052511.html 1 // ------------------------------------------------------------------ 2 // CaptureTest.cs 3 // Sample application to show the DirectX.Capture class library. 4 // 5 // History: 6 // 2003-Jan-25 BL - created

【2016-11-2】【坚持学习】【Day17】【微软 推出的SQLHelper】

从网络上找到 微软原版本的SQLHelper,很多行代码.认真看了,学习了. 代码: 1 using System; 2 using System.Data; 3 using System.Xml; 4 using System.Data.SqlClient; 5 using System.Collections; 6 7 namespace Helper 8 { 9 /// <summary> 10 /// The SqlHelper class is intended to encapsu

requirejs--源码分析

1 /*---------------------------------------start-------------------------------*/ 2 req({}); // num == 1 跳到 num == 2 3 4 //Exports some context-sensitive methods on global require. 5 each([ 6 'toUrl', 7 'undef', 8 'defined', 9 'specified' 10 ], funct

电影功夫熊猫使用的单词分析

你英语四级过了吗?功夫熊猫看了吗?功夫熊猫使用了995个英语单词,你会说很简单吧,别急,我给你分析一下,这些单词中有236个单词不在四级词汇范围内,花两分钟时间看看你是否认识这些单词,单词后面跟的数字表示该单词在电影中出现的次数. 你也可以获取本文的分析程序,这样你就可以分析其他电影了.看一部电影之前,先通过这种方式分析一下,然后学习自己不认识的单词,然后再去看电影,如此这样坚持下去,英语水平就会有很大的提升. words(995): 1. you 2492. the 1893. i 1844.