Chapter 10: Finding Independent Features
The goal of the unsupervised technique used in from chapter 10 is not trying to determine the outcomes from the data, but rather trying to charaterize the data from sets of data that are not labled with a specific outcome. Feature extraction is the process of finding new data rows that can be used in combination to recontruct the rows from the original dataset. The book used the cocktail party problem to describe feature extraction.
The cocktail party problem is the problem of understanding one person talking while many people are talking in the same room. The human brain seperates all the sounds and focuses on the one voice. A computer can be programmed to do the same thing. Feature extraction can also identify recurring word-usage patterns in a set of documents, which allows the computer to determine the independent features in each document.Then the computer can categorize Articles into themes using the indepentent features extracted from the documents.
Non-Negative Matrix Factorization(yes, this is a real-world application for Linear Algebra) uses a features matrix (Has a row for each feature and a column for every word with the values showing how important each word is to each feature) and a weights matrix (Maps the features to the articles matrix). When the features matrix and the weights matrix are multiplied they recreate a dataset similar to the original dataset.
Matrix operations may be used in Python if the NumPy package is used. An unnamed alogrithm is used uses NumPy to reconstruct the articles matrix as closely as possible by calculating the best features and weights matrices.
The following are used
data matrix: The original articles matrix.
hn: The transposed weight matrix by the data matrix.
hd: The transposed weights matrix multiplied by the weights matrix multiplied by the features matrix.
wn: The data matrix multiplied by the transposed features matrix.
wd: The weights matrix multiplied by the features matrix multiplied by the transposed features matrix.
To display the result the computer must go through each of the features and create a list of all words and their weights. Then the computer should display the top weighted words from the list and then go through all the articles and sort by their weights. Usually only the top articles are displayed.
The chapter ends with a stock market example which uses features extraction and Non-Negative Matrix Factorization.
Thursday, April 23, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment