In summary, the line cluster_mask = (kmeans_model.labels_ == cluster_idx) creates a mask that helps you filter and select the data points in a specific cluster based on their assigned cluster labels. This mask can then be used to perform various operations on the data points belonging to that cluster. Predicted prices are assigned to the unlabeled data based on the calculated cluster average prices:
# Assign predicted prices to unlabeled data
predicted_prices = np.array([cluster_avg_prices[cluster] \
for cluster in unlabelled_clusters])
Finally, we display the predicted prices for the unlabeled data using the K-means clustering technique, providing insights into the potential housing prices for the unlabeled samples.:
print(“Predicted Prices for Unlabelled Data:”, predicted_prices)
Here’s the output for the predicted labels:

Figure 3.7 – Predicted price for unlabeled data based on the mean value of the labeled data cluster
As shown in the output, we can predict the house price for unlabeled data using K-means clustering when there is a scarce training dataset. Then, we can combine the predicted labeled dataset and the original training dataset to fit the model using regression:
# Combine the labeled and newly predicted data
new_data = pd.concat([labeled_data, unlabeled_data], ignore_index=True)
new_data[‘price’] = pd.concat([train_labels, \
pd.Series(predicted_labels)], ignore_index=True)
# Train a new model on the combined data
new_train_data, new_test_data, new_train_labels, new_test_labels = \
train_test_split(new_data.drop(‘price’, axis=1), \
new_data[‘price’], test_size=0.2)
new_regressor = LinearRegression()
new_regressor.fit(new_train_data, new_train_labels)
# Evaluate the performance of the new model on the test data
score = new_regressor.score(new_test_data, new_test_labels)
print(“R^2 Score: “, score)
Overall, we have seen how clustering can be used in unsupervised learning to generate labels for unlabeled data. By computing the average label value for each cluster, we can effectively assign labels to the unlabeled data points based on their similarity to the labeled data.
Summary
In this chapter, we have explored a range of techniques to tackle the challenge of data labeling in regression tasks. We began by delving into the power of summary statistics, harnessing the mean of each feature in the labeled dataset to predict labels for unlabeled data. This technique not only simplifies the labeling process but also introduces a foundation for accurate predictions.
Further enriching our labeling arsenal, we ventured into semi-supervised learning, leveraging a small set of labeled data to generate pseudo-labels. The amalgamation of genuine and pseudo-labels in model training not only extends our labeled data but also equips our models to make more informed predictions for unlabeled data.
Data augmentation has emerged as a vital tool in enhancing regression data. Techniques such as scaling and noise injection have breathed new life into our dataset, providing varied instances that empower models to discern patterns better and boost prediction accuracy.
The utilization of k-means clustering rounded off our exploration, as we ventured into grouping data into clusters and assigning labels based on cluster mean values. This approach not only saves time but also bolsters the prediction precision of our models.
The key takeaways from this chapter are that summary statistics simplify data labeling by leveraging means and distances. Semi-supervised learning merges genuine and pseudo-labels for comprehensive training. Data augmentation techniques such as scaling and noise addition enrich and diversify datasets. K-means clustering optimizes labeling by grouping data into clusters and assigning cluster-wide mean labels.
These acquired skills bestow resilience and versatility to our regression models, instilling them with the ability to handle real-world, unlabeled data effectively. In the next chapter, we’ll delve into the exploratory data analysis of image data in machine learning.