Exercise 1: Tanimoto Score
Here is my Tanimoto score function:
# Returns the Tanimoto score for person1 and person2
def tanimoto_score(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]: si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
tanimoto =sum([prefs[person1][item]*prefs[person2][item] / (pow(prefs[person1][item],2)+pow(prefs[person2][item],2)-prefs[person1][item]*prefs[person2][item]) for item in prefs[person1] if item in prefs[person2]])
return 1/(1+tanimoto)
When I ran the function I got:
>>> recommendations.tanimoto_score(recommendations.critics,'Lisa Rose', 'Gene Seymour')
0.15581371067992209
To calculate the Tanimoto Score I used:
From: http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29
The Tanimoto Score is often used to find similarity between two documents.
Part 1: Weka
Part 2: Cleveland Heart Disease Dataset
I downloaded the dataset in the ARFF file format, which made this part very quick to do. When I classified the the data I found,
Correctly Classified Instances: 77.558%
Incorrectly Classified Instances: 22.4422%
Total Instances: 303
I believe the J48 classifier is a fairly accurate method of analyzing data, however, I believe running several types of classifiers would help us predict more accurate information. When I ran the Random Forest classifier I found,
Correctly Classified Instances: 81.5182%
Incorrectly Classified Instances: 18.4818%
Total Instances: 303
The Random Forest classifier yielded more accurate information. When I ran the Decision Stump I yielded less accurate information than the J48 classifier.
Correctly Classified Instances: 71.6172%
Incorrectly Classified Instances: 28.3828%
Total Instances: 303
After the comparisons, I came to the conclusion that multiple classifiers should be ran when analyzingdatasets in order to give the best results possible.
I found that the majority of the people in the dataset were male (Total: 207) while women were the minority (Total: 96). The amount of people with fbs was 258 versus while 48 did not. The median age was 55.366 and thal was the first set of data on the tree.