Wednesday, April 29, 2009

Portfolio Assignment 9

For our final project we took a closer look at the N.E.R.O. machine learning game. N.E.R.O. stands for Neuro-Evolving Robotic Operatives. The game allows the user to train a team of robots to perform simple tasks, like going around a wall.
When the training your robots for the first time, your troops just run around. This is because the troops have not been trained to do anything. The first step is to put a static enemy in and reward your troops for approaching the enemy and firing upon the enemy. Next the user can put a wall between the troops, so the robots cannot see the enemy. Depending on how your robots are rewarded, your troops should start to navigate around the wall and fire upon the enemy. After this task, you can start to train your troops to defeat a maze, a turret and even a group of moving enemies.
There were several setbacks while using N.E.R.O., such as N.E.R.O. crashing every twenty minutes of use. This was a major setback because time is needed when training your robots to do certain tasks. We wanted to be able to start the training and let the troops learn overnight and with the crashing of the program we couldn't do it. Another set back was the intensive calculations required on for the program to run, which made running N.E.R.O. hard to do on my laptop for more than a few minutes at a time without the laptop heat way up. 
I enjoyed using N.E.R.O. and I found it interesting to see the robots learning. After training for an hour I compared my troops to untrained troops and the difference was amazing. If we were able to allow the program to run for 24 hours with converging the brains of the fittest robot every five minutes we would see the extent of the training capabilities of N.E.R.O.. 

Monday, April 27, 2009

Random Thing I found on Engadget

http://www.engadget.com/2009/04/27/ibms-watson-to-rival-humans-in-round-of-jeopardy/

Thursday, April 23, 2009

Portfolio 8

Chapter 10: Finding Independent Features

The goal of the unsupervised technique used in from chapter 10 is not trying to determine the outcomes from the data, but rather trying to charaterize the data from sets of data that are not labled with a specific outcome. Feature extraction is the process of finding new data rows that can be used in combination to recontruct the rows from the original dataset. The book used the cocktail party problem to describe feature extraction.
The cocktail party problem is the problem of understanding one person talking while many people are talking in the same room. The human brain seperates all the sounds and focuses on the one voice. A computer can be programmed to do the same thing. Feature extraction can also identify recurring word-usage patterns in a set of documents, which allows the computer to determine the independent features in each document.Then the computer can categorize Articles into themes using the indepentent features extracted from the documents.
Non-Negative Matrix Factorization(yes, this is a real-world application for Linear Algebra) uses a features matrix (Has a row for each feature and a column for every word with the values showing how important each word is to each feature) and a weights matrix (Maps the features to the articles matrix). When the features matrix and the weights matrix are multiplied they recreate a dataset similar to the original dataset.
Matrix operations may be used in Python if the NumPy package is used. An unnamed alogrithm is used uses NumPy to reconstruct the articles matrix as closely as possible by calculating the best features and weights matrices.

The following are used

data matrix: The original articles matrix.
hn: The transposed weight matrix by the data matrix.
hd: The transposed weights matrix multiplied by the weights matrix multiplied by the features matrix.
wn: The data matrix multiplied by the transposed features matrix.
wd: The weights matrix multiplied by the features matrix multiplied by the transposed features matrix.

To display the result the computer must go through each of the features and create a list of all words and their weights. Then the computer should display the top weighted words from the list and then go through all the articles and sort by their weights. Usually only the top articles are displayed.

The chapter ends with a stock market example which uses features extraction and Non-Negative Matrix Factorization.

Wednesday, April 22, 2009

Portfolio 7

I was unable to complete portfolio 7 because I was unable to get pysqlite to work on my computer or the lab computers. Pysqlite needs Python 2.5 and will not work with Python 2.6....

Chapter 4 had to deal with searching and ranking. For example, Google's page rank algorithm. The first step is crawling, which is starting with a small set of documents and following the links from the documents to find new documents. After a large set of documents have been found they are indexed into a table with the documents and the locations of the words. The last step is returning a ranked list of documents.

Ranking queries can be done by creating a neural network which will associate searches with results based on what links people click on after they get a list of results. The neural network will use the new queries to alter the ranking of documents.

Content-Based Ranking gives a scores to pages for each query. This is done by using word frequency (The number of times the words from a given query are in the document), document location (The closer to the beginning a word is, the higher the score) and word distance (If multiple words are used in a query, then the closer those words are together in a document, the higher the score).

Many search engines also rely upon the number of times a link is clicked. The content-based ranking with the simple click count is how Google created the infamous Page-Rank Algorithm

Wednesday, February 25, 2009

Portfolio Assignment 5

The visualization I found was the 200 Best Video Games of All Time. 

The data set can be found here: http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/the-200-best-reviewed-computer-and-v/versions/1

My favorite visualization is this one, except the label is changed from platform to company.
This visualization shows the average score of video games for each platform. This visualization is useful for showing which companies are rated the highest per video game platform. For example, EA Sports has a much higher average score for PS2 compared to the Game Cube and the XBox, thus players would more likely to buy an EA Sports game for the PS2 rather than for the Game Cube or the XBox.


This is a pie chart showing how many games each platform and the percentage of the market each platform has. While the data is slightly outdated, this pie chart function is to show the amount of games per each platform compared to other platforms. For example, the pie chart shows the massive library of PS2 video games compared to the Game Cube library. 


My Dataset to be used for my visualization.

http://manyeyes.alphaworks.ibm.com/manyeyes/visualize/the-value-of-the-minimum-wage-1947-2-3

Saturday, February 21, 2009

Portfolio 4

Part 1:

I downloaded the provided code and ran the generatefeedvector.py, which took several minutes to finish and I received this output:

>>> 
Failed to parse feed http://radar.oreilly.com/index.rdf


Failed to parse feed http://www.techeblog.com/index.php/feed/


The Superficial - Because You're Ugly
Wonkette
Publishing 2.0
Eschaton
Mashable!
we make money not art
Joho the Blog
Neil Gaiman's Journal
Signal vs. Noise
Online Marketing Report
Kotaku
ReadWriteWeb
Deadspin
John Battelle's Searchblog
How to Change the World
43 Folders
Daily Kos
Stepcase Lifehack
Power Line
Giga Omni Media, Inc.
GoFugYourself
Google Operating System
Gawker: Valleywag
Gizmodo
ScienceBlogs : Combined Feed
Michelle Malkin
Lifehacker
SimpleBits
Slashdot
Gothamist
Instapundit
BuzzMachine
Sifry's Alerts
Topix.net Weblog
The Viral Garden
Micro Persuasion
Cool Hunting
flagrantdisregard
Search Engine Watch Blog
Joystiq
Boing Boing
Download Squad
Captain's Quarters
MAKE Magazine
Engadget
blog maverick
Techdirt
The Blotter
Crooks and Liars
TMZ.com
Bloggers Blog: Blogging the Blogsphere
Schneier on Security
Search Engine Roundtable
Copyblogger
Think Progress
Little Green Footballs
SpikedHumor - Today's Videos and Pictures
456 Berea Street
The Full Feed from HuffingtonPost.com
Pharyngula
Creating Passionate Users
TechCrunch
PaulStamatiou.com
TreeHugger
Steve Pavlina's Personal Development Blog
NewsBusters.org - Exposing Liberal Media Bias
kottke.org
MetaFilter
ongoing
Oilman
The Daily Dish | By Andrew Sullivan
Joi Ito's Web
A Consuming Experience (full feed)
mezzoblue
Matt Cutts: Gadgets, Google, and SEO
The Unofficial Apple Weblog (TUAW)
Wired Top Stories
The Official Google Blog
Joel on Software
Scobleizer -- Tech geek blogger
Bloglines | News
Quick Online Tips
Derek Powazek
Jeremy Zawodny's blog
WIL WHEATON dot NET: in exile
gapingvoid: "cartoons drawn on the back of business cards"
Shoemoney - Skills To Pay The Bills
Autoblog
Google Blogoscoped
plasticbag.org
Gawker
Celebrity gossip juicy celebrity rumors Hollywood gossip blog from Perez Hilton
Talking Points Memo
ProBlogger Blog Tips
Seth's Blog
>>>

I am having trouble installing the Python Imaging Library on my computer, which is because X code was deleted off my computer when I reinstalled Mac OS X on top of the same operating system because my system library became corrupted during a maintenance operation. I plan on wiping my computer over spring break. However, until then I am unable to run the Python Imaging Library, here is what I did:

30:Desktop thejimshaw$ cd Imaging-1.1.6
230:Imaging-1.1.6 thejimshaw$ ls
BUILDME PIL _imaging.c doctest.py selftest.py
CHANGES PIL.pth _imagingft.c encode.c setup.py
CONTENTS README _imagingmath.c libImaging
Docs Sane _imagingtk.c map.c
Images Scripts decode.c outline.c
MANIFEST Tk display.c path.c
230:Imaging-1.1.6 thejimshaw$ python setup.py install
running install
running build
running build_py
creating build
creating build/lib.macosx-10.3-i386-2.5
copying PIL/__init__.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ArgImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/BdfFontFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/BmpImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/BufrStubImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ContainerIO.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/CurImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/DcxImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/EpsImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ExifTags.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/FitsStubImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/FliImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/FontFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/FpxImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/GbrImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/GdImageFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/GifImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/GimpGradientFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/GimpPaletteFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/GribStubImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/Hdf5StubImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/IcnsImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/IcoImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/Image.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageChops.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageColor.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageDraw.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageDraw2.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageEnhance.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageFileIO.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageFilter.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageFont.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageGL.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageGrab.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageMath.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageMode.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageOps.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImagePalette.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImagePath.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageQt.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageSequence.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageStat.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageTk.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageTransform.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImageWin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/ImtImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/IptcImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/JpegImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/McIdasImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/MicImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/MpegImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/MspImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/OleFileIO.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PaletteFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PalmImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PcdImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PcfFontFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PcxImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PdfImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PixarImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PngImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PpmImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PsdImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/PSDraw.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/SgiImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/SpiderImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/SunImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/TarIO.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/TgaImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/TiffImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/TiffTags.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/WalImageFile.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/WmfImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/XbmImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/XpmImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
copying PIL/XVThumbImagePlugin.py -> build/lib.macosx-10.3-i386-2.5
running build_ext
--- using frameworks at /System/Library/Frameworks
building '_imaging' extension
creating build/temp.macosx-10.3-i386-2.5
creating build/temp.macosx-10.3-i386-2.5/libImaging
Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk
Please check your Xcode installation
gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd -fno-common -dynamic -DNDEBUG -g -O3 -DHAVE_LIBZ -IlibImaging -I/Library/Frameworks/Python.framework/Versions/2.5/include -I/Library/Frameworks/Python.framework/Versions/2.5/include/python2.5 -c _imaging.c -o build/temp.macosx-10.3-i386-2.5/_imaging.o
unable to execute gcc: No such file or directory
error: command 'gcc' failed with exit status 1
230:Imaging-1.1.6 thejimshaw$ 

Part 2
I am having trouble with my data sets and since I am unable to install Python Imaging Library I was unable to complete this section. 

Part 3

I explored Many Eyes (http://manyeyes.alphaworks.ibm.com/manyeyes/), which is a visualization tool for data sets. I found some data sets which worked well, such as the Movie Genres 1888 to 2012 Count ( http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/movie-genres-1888-2012-count) and the Words Appearing in Titles of Film Noir Movies on IMDB.
(http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/words-appearing-in-titles-of-film-no)

I also discovered some data sets which did not work well, such as Movies By Genre.
(http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/new/wordle/movies-by-genre-20)

I enjoyed how easy the data sets are downloadable, the ability to see the same data in several different ways and how users are able to share their visualizations with other users. 

Thursday, February 12, 2009

Last.FM

My group, Dylan Jesse and I, used C# to create our music recommendation front end. We used the Last.FM api with these functions:

Artist Search: getRecommended
Artist: getPage
Album Search: search
Album: getPage
Artist Bio: getUrl
Album Wiki: getUrl

We have our program automatically logging into Last.FM and then allows the user to select the search/recommendation and allows the user to type in what they are searching for. The program then returns all possible matches and the user can click on the results to open a web object with the web page regarding the object they clicked on. Enjoy!

Saturday, January 31, 2009

Portfolio 2

Exercise 1: Tanimoto Score

Here is my Tanimoto score function:

# Returns the Tanimoto score for person1 and person2
def tanimoto_score(prefs,person1,person2):
  # Get the list of shared_items
  si={}
  for item in prefs[person1]: 
    if item in prefs[person2]: si[item]=1

  # if they have no ratings in common, return 0
  if len(si)==0: return 0

  # Add up the squares of all the differences
  tanimoto =sum([prefs[person1][item]*prefs[person2][item] / (pow(prefs[person1][item],2)+pow(prefs[person2][item],2)-prefs[person1][item]*prefs[person2][item]) for item in prefs[person1] if item in prefs[person2]])
  return 1/(1+tanimoto)

When I ran the function I got:
>>> recommendations.tanimoto_score(recommendations.critics,'Lisa Rose', 'Gene Seymour')
0.15581371067992209

To calculate the Tanimoto Score I used:





From: http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29

The Tanimoto Score is often used to find similarity between two documents.

Part 1: Weka

I found the Weka tool to be very easy to use. The program gives us a quick and easy way of analyzing the raw data from an ARFF file. I received the same answers as we were given.

Part 2: Cleveland Heart Disease Dataset

I downloaded the dataset in the ARFF file format, which made this part very quick to do. When I classified the the data I found,

Correctly Classified Instances: 77.558%
Incorrectly Classified Instances: 22.4422%
Total Instances: 303

I believe the J48 classifier is a fairly accurate method of analyzing data, however, I believe running several types of classifiers would help us predict more accurate information. When I ran the Random Forest classifier I found,

Correctly Classified Instances: 81.5182%
Incorrectly Classified Instances: 18.4818%
Total Instances: 303

The Random Forest classifier yielded more accurate information. When I ran the Decision Stump I yielded less accurate information than the J48 classifier.

Correctly Classified Instances: 71.6172%
Incorrectly Classified Instances: 28.3828%
Total Instances: 303

After the comparisons, I came to the conclusion that multiple classifiers should be ran when analyzingdatasets in order to give the best results possible.

I found that the majority of the people in the dataset were male (Total: 207) while women were the minority (Total: 96). The amount of people with fbs was 258 versus while 48 did not. The median age was 55.366 and thal was the first set of data on the tree.

Saturday, January 24, 2009

Assignment 1

My first python function was the sim_distance method

#returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
    # Get the list of shared_items
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item]=1

    # if they have no ratings in common, return 0
    if len(si)==0:return 0

    # Add up the squares of all the differences
    sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])
    return 1/(1+(sum_of_squares))

I found the error on page 11 and changed 

return 1/(1+sqrt(sum_of_squares))

to

return 1/(1+(sum_of_squares))

>>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.14814814814814814
_________________________________________________________________
Here is my Manhattan Distance Function:

#returns a distance-based similarity score for person1 and person2
def man_distance(prefs,person1,person2):
    # Get the list of shared_items
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item]=1

    # if they have no ratings in common, return 0
    if len(si)==0:return 0

    # Add up the squares of all the differences
    ManDistance = [ abs(prefs[person1][item] - prefs[person2][item]) for item in si ]

    return (1/(1+sum(ManDistance)))

>>> recommendations.man_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.18181818181818182
_________________________________________________________________

I found the Collective Intelligence book to be a good way of learning how to program in Python, assuming one has previous programming experience. I liked how the book starts off with explaining the importance of collective intelligence along with real-world examples. Next the book gives a basic example of a recommendation system to program in Python and the code starts to increase in complexity.

I like how the chapters are not too verbose and are easy to understand. I was slightly disappointed how the book did not give the Manhattan Distance formula in the book, but rather gave a link to a wikipedia article. However, I am enjoying using this book and I hope the book continues to be good.

Wednesday, January 21, 2009

Hello Blog World!

I will be adding posts with regards to Data Mining (CPSC 470u at UMW, taught by Dr. Zacharski(http://www.zacharski.org/))