Saturday, January 31, 2009

Portfolio 2

Exercise 1: Tanimoto Score

Here is my Tanimoto score function:

# Returns the Tanimoto score for person1 and person2
def tanimoto_score(prefs,person1,person2):
  # Get the list of shared_items
  si={}
  for item in prefs[person1]: 
    if item in prefs[person2]: si[item]=1

  # if they have no ratings in common, return 0
  if len(si)==0: return 0

  # Add up the squares of all the differences
  tanimoto =sum([prefs[person1][item]*prefs[person2][item] / (pow(prefs[person1][item],2)+pow(prefs[person2][item],2)-prefs[person1][item]*prefs[person2][item]) for item in prefs[person1] if item in prefs[person2]])
  return 1/(1+tanimoto)

When I ran the function I got:
>>> recommendations.tanimoto_score(recommendations.critics,'Lisa Rose', 'Gene Seymour')
0.15581371067992209

To calculate the Tanimoto Score I used:





From: http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29

The Tanimoto Score is often used to find similarity between two documents.

Part 1: Weka

I found the Weka tool to be very easy to use. The program gives us a quick and easy way of analyzing the raw data from an ARFF file. I received the same answers as we were given.

Part 2: Cleveland Heart Disease Dataset

I downloaded the dataset in the ARFF file format, which made this part very quick to do. When I classified the the data I found,

Correctly Classified Instances: 77.558%
Incorrectly Classified Instances: 22.4422%
Total Instances: 303

I believe the J48 classifier is a fairly accurate method of analyzing data, however, I believe running several types of classifiers would help us predict more accurate information. When I ran the Random Forest classifier I found,

Correctly Classified Instances: 81.5182%
Incorrectly Classified Instances: 18.4818%
Total Instances: 303

The Random Forest classifier yielded more accurate information. When I ran the Decision Stump I yielded less accurate information than the J48 classifier.

Correctly Classified Instances: 71.6172%
Incorrectly Classified Instances: 28.3828%
Total Instances: 303

After the comparisons, I came to the conclusion that multiple classifiers should be ran when analyzingdatasets in order to give the best results possible.

I found that the majority of the people in the dataset were male (Total: 207) while women were the minority (Total: 96). The amount of people with fbs was 258 versus while 48 did not. The median age was 55.366 and thal was the first set of data on the tree.

Saturday, January 24, 2009

Assignment 1

My first python function was the sim_distance method

#returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
    # Get the list of shared_items
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item]=1

    # if they have no ratings in common, return 0
    if len(si)==0:return 0

    # Add up the squares of all the differences
    sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si])
    return 1/(1+(sum_of_squares))

I found the error on page 11 and changed 

return 1/(1+sqrt(sum_of_squares))

to

return 1/(1+(sum_of_squares))

>>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.14814814814814814
_________________________________________________________________
Here is my Manhattan Distance Function:

#returns a distance-based similarity score for person1 and person2
def man_distance(prefs,person1,person2):
    # Get the list of shared_items
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item]=1

    # if they have no ratings in common, return 0
    if len(si)==0:return 0

    # Add up the squares of all the differences
    ManDistance = [ abs(prefs[person1][item] - prefs[person2][item]) for item in si ]

    return (1/(1+sum(ManDistance)))

>>> recommendations.man_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.18181818181818182
_________________________________________________________________

I found the Collective Intelligence book to be a good way of learning how to program in Python, assuming one has previous programming experience. I liked how the book starts off with explaining the importance of collective intelligence along with real-world examples. Next the book gives a basic example of a recommendation system to program in Python and the code starts to increase in complexity.

I like how the chapters are not too verbose and are easy to understand. I was slightly disappointed how the book did not give the Manhattan Distance formula in the book, but rather gave a link to a wikipedia article. However, I am enjoying using this book and I hope the book continues to be good.

Wednesday, January 21, 2009

Hello Blog World!

I will be adding posts with regards to Data Mining (CPSC 470u at UMW, taught by Dr. Zacharski(http://www.zacharski.org/))