Why is Snorkel popular? – Labeling Data for Classification

Snorkel is used by several large companies, such as Google, Apple, Stanford Medicine, and Tide, and it has proven to solve business problems on a large scale. For example, as mentioned on the snorkel.ai website, Google used Snorkel to replace 100K+ hand-annotated labels in critical ML pipelines for text classification. Another example is, researchers at Stanford Medicine used Snorkel to label medical imaging and monitoring datasets, converting years of work of hand-labeling into a several-hour job using Snorkel. Apple built a solution called Overton that utilizes Snorkel’s framework of weak supervision to overcome cost, privacy, and cold-start issues. A UK-based fintech company, Tide, used Snorkel to label matching invoices with transactions, which otherwise would have required investing in expensive subject matter experts to hand-label historical data. Excerpts of these case studies and more can be found at https://snorkel.ai/case-studies.

Snorkel creates weak labels using business rules and patterns in the absence of labeled data. Using Snorkel requires relatively less development effort compared to crowdsourcing manual labeling. The cost of development also comes down due to automation using Python programming instead of hiring expensive business domain experts.

The weak labels developed by Snorkel label model can be used to train a machine learning classifier for classification and information extraction tasks. In the following sections, we are going to see the step-by-step process of generating weak labels using Snorkel. For that, first, let’s load some unlabeled data from a CSV file.

In the following example, we have a limited dataset related to adult income with labels.

In this adult income dataset, we have the income label and features such as age, working hours, education, and work class. The income label class is available only for a few observations. The majority of observations are unlabeled without any value for income. So, our goal is to generate the labels using the Snorkel Python library.

We will come up with business rules after analyzing the correlation between each individual feature of the income label class. Using those business rules, we will create one labeling function for each business rule with the Snorkel Python library. Then, we will generate labels for the unlabeled income dataset by applying those labeling functions.

Loading unlabeled data

We will first load the adult income dataset for predicting weak labels. Let us load the data using Pandas into a DataFrame:
# loading the data set
#df = pd.read_csv(“<YourPath>/adult_income.csv”, encoding=’latin-1)’

Here, don’t forget to replace <YourPath> with the path in your system. The target column name is income. Let us make sure the data is loaded correctly:
df.head()

Let us model the input and output:
x = df.iloc[:,:-1]
y = df[“income”]

In this section, we have loaded the income dataset from a CSV file into a Pandas DataFrame.

Creating the labeling functions

Labeling functions implement the rules that are used to create the labeling model. Let us import the following method, which is required to implement the labeling functions:
from snorkel.labeling import labeling_function

We need to create at least four labeling functions to create an accurate labeling generative model. In order to create labeling functions, let’s first define the labeling rules based on the available small dataset labels. If no data is available, subject matter expert knowledge is required to define the rules for labeling. Labeling functions use the following labeling rules for the classification of income range for the given adult income dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *