Course Project Notes: CS7650 NLP
Published:
1. Project #1 - Text Classification
- (1) Goal: Train a model to classifiy positive and negative movie reviews.
- (2) Methods: In three different ways:
- perceptron algorithm
- simple neural bag-of-word model - based on this paper
- 1-D Convolutional Neural Networks - similar to the CNN-rand baseline model
- (3) Data: a small version of the ACL IMDB dataset. The full version is here
- data schema: under the folder of ‘acllmdb_small’, there are two subfolders ‘neg’ and ‘pos’. Under each of them, there are multiple txt file, with each containing one line of string.
- a negative example: ‘This is my first comment on a movie in here. I have to say that of all the bad films I ever seen, including Braindead for an example, this is really WORSE! I promise. Don’t even look at it. It is boring, bad acting, bad script and plot, bad effects the whole movie is one big piece of crap! If I could I would give 0 stars out of 10, but since the lowest is 1 which is awful, I need to vote that. But I would say the movie is worse than awful.
Don’t pain yourself by seeing this movie and hoping it will get better because I can tell you already now, it wont! I hoped that there might would come one single scene which would be worth watching. There didn’t came any good scene at all.
What an excellent piece of crap.
And Coolio as a vampire? LOL! LMAO! YARGH!’
- a negative example: ‘This is my first comment on a movie in here. I have to say that of all the bad films I ever seen, including Braindead for an example, this is really WORSE! I promise. Don’t even look at it. It is boring, bad acting, bad script and plot, bad effects the whole movie is one big piece of crap! If I could I would give 0 stars out of 10, but since the lowest is 1 which is awful, I need to vote that. But I would say the movie is worse than awful.
- a positive example: ‘A ghost story on a typical Turkish high school with nice sound and visual effects. Taylan biraderler(taylan brothers) had made their first shots on a tv-show a couple of years ago, as far as i know. That was kind of a Turkish X-Files, they had very nice stories but lacked on visual effects. This time it seems they had what they needed and used them well. This movie will make you laugh that’s for sure, and as well it might have you scared. It has a nice plot and some young, bright actors in it. If you are a high school student in Turkey you will find so many things about you here. There are many clues in the movie about its story and ending, some you understand at the moment, some will make sense afterwords, the dialogs were written very cleverly. So these make the movie one of the best Turk movies made in the last years. Do not forget, this movie is the first of its kind in the Turkish film industry.’
- (4) Data preprocessing:
- Tokenization using nltk
- Hash words into index
- Set up a mapping between words and ids, and word_counts
- After preprocessing, we will get variables train, dev, and test as three ‘IMDBdata’ type objects. For input X, each object has two compressed sparse matrices, (1)X and (2)XwordList. For (1)X, it is a M-row sparse csr matrix, with M being the number of non-zero occurence words. And each row looks like
(0,1) 4
, which means on the 1st review instance(row), the index-1 word shows up 4 times. For (2)XwordList, it is more intuitive as an array of index forming each review sentence.tensor([68,85,38,126])
. For label Y, it is simply a ID array of +-1 indicating postivie/negative.
- After preprocessing, we will get variables train, dev, and test as three ‘IMDBdata’ type objects. For input X, each object has two compressed sparse matrices, (1)X and (2)XwordList. For (1)X, it is a M-row sparse csr matrix, with M being the number of non-zero occurence words. And each row looks like