Variables section – Exploring Data for Machine Learning

Let us walk through the Variables section in the report.

Under the Variables section, we can select any variable in the dataset under the dropdown and see the statistical information about the dataset, such as the number of unique values for that variable, missing values for that variable, the size of that variable, and so on.

In the following figure, we selected the age variable in the dropdown and can see the statistics about that variable:

Figure 1.28 – Variables

Interactions section

As shown in the following figure, this report also contains the Interactions plot to show how one variable relates to another variable:

Figure 1.29 – Interactions

Correlations

Now, let’s see the Correlations section in the report; we can see the correlation between various variables in Heatmap. Also, we can see various correlation coefficients in the Table form.

Figure 1.30 – Correlations

Heatmaps use color intensity to represent values. The colors typically range from cool to warm hues, with cool colors (e.g., blue or green) indicating low values and warm colors (e.g., red or orange) indicating high values. Rows and columns of the matrix are represented on both the x axis and y axis of the heatmap. Each cell at the intersection of a row and column represents a specific value in the data.

The color intensity of each cell corresponds to the magnitude of the value it represents. Darker colors indicate higher values, while lighter colors represent lower values.

As we can see in the preceding figure, the intersection cell between income and hours per week shows a high-intensity blue color, which indicates there is a high correlation between income and hours per week. Similarly, the intersection cell between income and capital gain shows a high-intensity blue color, indicating a high correlation between those two features.

Missing values

This section of the report shows the counts of total values present within the data and provides a good understanding of whether there are any missing values.

Under Missing values, we can see two tabs:

  • The Count plot
  • The Matrix plot

In Figure 1.31, the shows that all variables have a count of 32,561, which is the count of rows (observations) in the dataset. That indicates that there are no missing values in the dataset.

Figure 1.31 – Missing values count

Matrix plot

The following Matrix plot indicates where the missing values are (if there are any missing values in the dataset):

Figure 1.32 – Missing values matrix

Sample data

This section shows the sample data for the first 10 rows and the last 10 rows in the dataset.

Figure 1.33 – Sample data

This section shows the most frequently occurring rows and the number of duplicates in the dataset.

Figure 1.34 – Duplicate rows

We have seen how to analyze the data using Pandas and then how to visualize the data by plotting various plots such as bar charts and histograms using sns, seaborn, and pandas-ydata-profiling. Next, let us see how to perform data analysis using OpenAI LLM and the LangChain Pandas Dataframe agent by asking questions with natural language.

Leave a Reply

Your email address will not be published. Required fields are marked *