A. Premade estimators
train_y = train.pop('Species')
test_y = test.pop('Species')
classifier = tf.estimator.BoostedTreesClassifier()
classifier = tf.estimator.LoggingTensorHook()
classifier = tf.estimator.MultiLabelHead()
classifier = tf.estimator.ProfilerHook()
classifier = tf.estimator.StepCounterHook()
Estimator | Test Accuracy | Test Loss |
---|---|---|
DNN Classifier | 0.70 | 0.56 |
DNN Linear Combined Classifier | 0.73 | 0.56 |
Linear Classifier | 0.97 | 0.07 |
B. Build a Linear Model
A categorical column allows us to represent strings in a dataset, such as categorizing as dog or cat. The way it does this is the data is represented as a one-hot vector, where dog could be 1 and cat could be 0. A dense feature can include any number, not just 0 or 1 and can work for both numerical columns and categorical columns.
The feature columns input into the LinearClassifier() include sex, number of siblings and spouses, parch, class, deck, embark town, alone, age, and fare. Age and fare are numeric columns and the rest are categorical. The initial output had okay results with stats like an accuracy of 0.754, loss of 0.467, and average loss of 0.474. It appears that using the base feature columns was not good enough to produce the best results, so a cross featured column was added. This is important, as there could be a different correlation for different feature combinations that was not previously accounted for. In this model, the cross featured column was made for age and gender to attempt to find a relationship between age and gender and their effect on the label. The resulting output included an accuracy of 0.761, loss of 0.457, and average loss of 0.466. Although not by very much, these results are already better than the previous results by only including one cross featured column. Analyzing the predicted probabilities and ROC curves, we can see this visually.
Here we can see that there is a higher frequency of probabilities near zero for both plots with the plots being skewed to the right. However, the frequencies of the probabilities around 0.5 are lower than those closer to one. This shows that there are almost two groups of passengers, those who are very likely to die, and those who have a decent chance of living. With the sex of the passenger being the most important feature in this dataset, it appears that the plots have displayed this fact, with the low living chance group being the males and the higher chance group being the females.
In these graphs, the second graph outperforms the first, as it has a better ratio of true positives to false positives. This is true since the curve is closer to the top left corner and farther from where the diagonal would be. This means that there is a greater area under the curve, and therefore, better performance. When looking at the graph, although subtle, we can see these higher true positive rates in the second graph, as seen in locations like the spike up to 0.6 for the true positive rate at around 0.1 for the false positive rate. For the first graph, this spike is slightly less being around 0.55 instead, further illustrating the effect of cross featured columns.