Saturday, January 31, 2009

Portfolio 2

Exercise 1: Tanimoto Score

Here is my Tanimoto score function:

# Returns the Tanimoto score for person1 and person2
def tanimoto_score(prefs,person1,person2):
  # Get the list of shared_items
  si={}
  for item in prefs[person1]: 
    if item in prefs[person2]: si[item]=1

  # if they have no ratings in common, return 0
  if len(si)==0: return 0

  # Add up the squares of all the differences
  tanimoto =sum([prefs[person1][item]*prefs[person2][item] / (pow(prefs[person1][item],2)+pow(prefs[person2][item],2)-prefs[person1][item]*prefs[person2][item]) for item in prefs[person1] if item in prefs[person2]])
  return 1/(1+tanimoto)

When I ran the function I got:
>>> recommendations.tanimoto_score(recommendations.critics,'Lisa Rose', 'Gene Seymour')
0.15581371067992209

To calculate the Tanimoto Score I used:





From: http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29

The Tanimoto Score is often used to find similarity between two documents.

Part 1: Weka

I found the Weka tool to be very easy to use. The program gives us a quick and easy way of analyzing the raw data from an ARFF file. I received the same answers as we were given.

Part 2: Cleveland Heart Disease Dataset

I downloaded the dataset in the ARFF file format, which made this part very quick to do. When I classified the the data I found,

Correctly Classified Instances: 77.558%
Incorrectly Classified Instances: 22.4422%
Total Instances: 303

I believe the J48 classifier is a fairly accurate method of analyzing data, however, I believe running several types of classifiers would help us predict more accurate information. When I ran the Random Forest classifier I found,

Correctly Classified Instances: 81.5182%
Incorrectly Classified Instances: 18.4818%
Total Instances: 303

The Random Forest classifier yielded more accurate information. When I ran the Decision Stump I yielded less accurate information than the J48 classifier.

Correctly Classified Instances: 71.6172%
Incorrectly Classified Instances: 28.3828%
Total Instances: 303

After the comparisons, I came to the conclusion that multiple classifiers should be ran when analyzingdatasets in order to give the best results possible.

I found that the majority of the people in the dataset were male (Total: 207) while women were the minority (Total: 96). The amount of people with fbs was 258 versus while 48 did not. The median age was 55.366 and thal was the first set of data on the tree.

No comments:

Post a Comment