250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

데이터과학 삼학년

[Labeling] Snorkel 소개 본문

Feature Engineering

[Labeling] Snorkel 소개

Dan-k 2020. 11. 20. 15:38

Snorkel 소개

소개

직면한 문제를 해결하기 위해 ML을 사용하는 것은 일상이 됨
효과적인 ML은 비지도학습보다 지도학습이 보다 용이함

지도학습이 Model 학습뿐 아니라 학습한 Model을 평가하기 용이
지도학습을 위해서는 결국 Labeling 데이터가 필요

snorkel은 data를 labeling을 하는데 도움을 주는 라이브러리
hand-labeling 하는데 수 주,달의 시간이 걸린다면 해당 library를 통해서 수시간, 일이면 큰 규모의 학습 데이터 셋을 구축할 수 있음

핵심 기능

Labeling data

휴리스틱한 규칙 / 원격 지도 기술

Transforming data

반복 혹은 image 변환을 통한 데이터 생성 (data augmentation)

Slicing data

subset / clustering을 통한 분석 군집 생성

예시를 통한 이해

적용 단계 (get-started : 빠른 적용 예시)

Writing Labeling Functions (LFs)

Labeling을 위한 함수를 작성함
일일히 label을 적용하기 보다 일정의 규칙을 정해주는 것

Modeling & Combining LFs

snorkel 안에 있는 LabelModel을 이용해서 자동으로 우리가 정한 LF(Label Function)을 학습시켜 정확도를 올린다
reweight
각 data 포인트에서 나온 confidence-weighted training label과 single에서 나온 output을 결합

Writing Transformation Functions (TFs) for Data Augmentation

Transform function을 작성하고, 이를 통해 data를 augmentation (증식) 시킴

Writing Slicing Functions (SFs) for Data Subset Selection

증식된 데이터의 critical subset을 확인하기 위해 Slicing function을 사용함
즉, 생성한 data가 학습하기 합리적인 것인지 판단

Training a final ML model

생성한 data를 토대로 ML model을 학습함

1. Writing Labeling Functions (LFs)

LFs는 학습데이터의 subset에 label을 부여하는 핵심 단계
각 data point가 spam이면 1, spam이 아니면 0 혹은 labeling 하는 것을 금지하는 것을 -1 이라고 하자

# Define the label mappings for convenience
ABSTAIN = -1
NOT_SPAM = 0
SPAM = 1

labeling 함수는 휴리스틱한 룰 혹은 noisy data search 전략으로 매기는 방식으로 작성할 수 있음
여기서, labeling function을 쓰는 주 목적은 user의 domain 지식을 snorkel 프로그램에 넣는 다는데 있음
핵심 아이디어는 LFs가 정확히 정의될 필요가 없다는 것임, LFs 함수들끼리 상관관계가 높아도 상관없음
snorkel은 자동으로 정의된 LFs의 정확도와 상관관계를 추정하고, reweight를 매겨 고품질의 학습 dataset을 만듦
아래 처럼 LFs 다양하게 만들 수 있음
keyword matches

from snorkel.labeling import labeling_function

@labeling_function()
def lf_keyword_my(x):
    """Many spam comments talk about 'my channel', 'my video', etc."""
    return SPAM if "my" in x.text.lower() else ABSTAIN

Regular expression

import re

@labeling_function()
def lf_regex_check_out(x):
    """Spam comments say 'check out my video', 'check it out', etc."""
    return SPAM if re.search(r"check.*out", x.text, flags=re.I) else ABSTAIN

Arbitrary heuristics

@labeling_function()
def lf_short_comment(x):
    """Non-spam comments are often short, such as 'cool video!'."""
    return NOT_SPAM if len(x.text.split()) < 5 else ABSTAIN

Third-party models

from textblob import TextBlob


@labeling_function()
def lf_textblob_polarity(x):

    """
    We use a third-party sentiment classification model, TextBlob.
    We combine this with the heuristic that non-spam comments are often positive.
    """

    return NOT_SPAM if TextBlob(x.text).sentiment.polarity > 0.3 else ABSTAIN

위 예처럼 여러개의 labeling function을 생성할 수 있음

2. Combining & Cleaning the Labels

Label function을 label이 없는 data에 적용하는 단계

label matrix가 output으로 나오게 되고, 그 데이터는 각 row가 data point와 연결되어 있고 labeling function 이 적용됨

LFs 는 알려지지 않은 정확도와 상관도를 가지고 있기 때문에 output label 은 아마 겹치거나 충돌이 일어날수 있음
여기서, snorkel의 LabelModel은 자동으로 정확도와 상관도를 계산하고, reweight 그리고, label을 update 시킴

from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier


# Define the set of labeling functions (LFs)
lfs = [lf_keyword_my, lf_regex_check_out, lf_short_comment, lf_textblob_polarity]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)

L_train = applier.apply(df_train)


# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

각 LFs 로부터 많은 data point들이 abstain으로 label을 매길수 없는 케이스가 발생함

이에 ABSTRAIN은 제거함

즉, 원 데이터에서 label을 매길수 없다고 판단되는 데이터로 인해 데이터 유실이 일어나게 됨

df_train = df_train[df_train.label != ABSTAIN]

3. Writing Transformation Functions for Data Augmentation

Data Augmentation은 도메인 지식과 상관없이 data를 증식시킬 수 있는 획신적인 방법임
image의 경우, 회전, 반전, 늘이고, 줄이기를 통해 data augmentation을 진행하는 것으로 잘 알려져 있으나, image 외 부분은 좀 약한 편임
여기선, text data를 augmentation하는 전략에 대해 소개함
Data Augmentation을 위해 Transformation Function을 작성

형태소 분석 시행
명사를 같은 의미를 갖는 동의어(synonym)로 치환시킴
결국, 같은 의미의 다른 Data가 생성되게 됨

df_train shape 변화

(1956,3) → (1387,4) [LFs 적용] → (2701,4) [data augmentation]

import random
import nltk
from nltk.corpus import wordnet as wn
from snorkel.augmentation import transformation_function

nltk.download("wordnet", quiet=True)


def get_synonyms(word):
    """Get the synonyms of word from Wordnet."""
    lemmas = set().union(*[s.lemmas() for s in wn.synsets(word)])
    return list(set(l.name().lower().replace("_", " ") for l in lemmas) - {word})


@transformation_function()
def tf_replace_word_with_synonym(x):
    """Try to replace a random word with a synonym."""
    words = x.text.lower().split()
    idx = random.choice(range(len(words)))
    synonyms = get_synonyms(words[idx])

    if len(synonyms) > 0:
        x.text = " ".join(words[:idx] + [synonyms[0]] + words[idx + 1 :])
        
        return x

snorkel의 dataAugmentation 함수 사용

from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier

tf_policy = ApplyOnePolicy(n_per_original=2, keep_original=True)
tf_applier = PandasTFApplier([tf_replace_word_with_synonym], tf_policy)

df_train_augmented = tf_applier.apply(df_train)

4. Writing a Slicing Function

Slicing Functions을 이용하여, 많은 데이터 중 실제로 이용할 수 있는 중요한 데이터 선별
snorkel 에서 SFs 활용성

monitoring
model performance 개선

from snorkel.slicing import slicing_function


@slicing_function()
def short_link(x):
    """Return whether text matches common pattern for shortened ".ly" links."""
    return int(bool(re.search(r"\w+\.ly", x.text)))

5. Training a Classifier

모델 학습에 snorkel로 만든 data 활용

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression


train_text = df_train_augmented.text.tolist()

X_train = CountVectorizer(ngram_range=(1, 2)).fit_transform(train_text)


clf = LogisticRegression(solver="lbfgs")

clf.fit(X=X_train, y=df_train_augmented.label.values)

출처 : https://www.snorkel.org/get-started/

Get Started · Snorkel

www.snorkel.org

728x90

LIST

'Feature Engineering' 카테고리의 다른 글

Feature Selection :: Recursive Feature Elimination (RFE) (0)	2023.09.21
All about Feature Scaling (0)	2022.06.21
Ch.9 Back to the Feature: Building an Academic Paper Recommender (0)	2020.06.03
Ch.7 Nonlinear Featurization viaK-Means Model Stacking (0)	2020.05.21
Ch.6 Dimensionality Reduction: Squashing the Data Pancake with PCA (0)	2020.05.06

'Feature Engineering' Related Articles

Comments