데이터과학 삼학년

SCIKIT_LLM (sklearn + llm), large language model을 쉽게 쓰자!!! 본문

Machine Learning

SCIKIT_LLM (sklearn + llm), large language model을 쉽게 쓰자!!!

Dan-k 2023. 7. 31. 13:00
반응형

SCIKIT_LLM

  • open-AI의 llm 모델을 사용하기 편리하게 나온 툴
    • sklearn + llm(large language model)
  • 익숙한 sklearn 학습, 예측 방식으로 llm의 모델들을 편리하게 활용 가능
  • llm의 모델들을 쓰는 장점 → 텍스트를 벡터화하고 전처리하는 과정들이 생략될 수 있다.!!!
    • 아래 예제를 통해 확인!!!

Configuring OpenAI API Key

  • Scikit-LLM estimators 는 OpenAI API key 가 필요
from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("<YOUR_KEY>")
SKLLMConfig.set_openai_org("<YOUR_ORGANISATION>")
  • free 라이센스의 경우, 1분당 3번의 요청으로 요청 제한이 있음!! (rate limits )
  • SKLLMConfig.set_openai_org: organization ID(NOT the name.) You can find your ID here.

Using Azure OpenAI

from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("<YOUR_KEY>")  # use azure key instead
SKLLMConfig.set_azure_api_base("<API_BASE>")

# start with "azure::" prefix when setting the model name
model_name = "azure::<model_name>"
# e.g. ZeroShotGPTClassifier(openai_model="azure::gpt-3.5-turbo")

Using GPT4ALL

  • GPT4를 쓰려면 아래처럼 모듈 추가 설치해야 함
pip install "scikit-llm[gpt4all]"
  • format:  gpt4all::<model_name>
SKLLMConfig.set_openai_key("any string")
SKLLMConfig.set_openai_org("any string")

ZeroShotGPTClassifier(openai_model="gpt4all::ggml-model-gpt4all-falcon-q4_0.bin")

Supported models by a non-standard backend

At the moment only the following estimators support non-standard backends (gpt4all, azure):

  • ZeroShotGPTClassifier
  • MultiLabelZeroShotGPTClassifier
  • FewShotGPTClassifier

Zero-Shot Text Classification

  • (강력한 기능) 모델을 재학습시킬 필요가 없이 ChatGPT를 호출하여 텍스트 분류 문제를 풀 수 있음!!→ 파인튜닝이라던지 추가 모델 학습이 필요 없는 것
  • zeroShot이란 label정보를 모델에 다 입력해주지 않아도 모델이 알아서 label을 판단 → 메타러닝처럼 label을 가진 데이터간의 유사도도 확인
    • 소와 말의 label을 던져주고 얼룩말도 구분
  • We provide a class ZeroShotGPTClassifier that allows to create such a model as a regular scikit-learn classifier.

Example 1: Training as a regular classifier

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# demo sentiment analysis dataset
# labels: positive, negative, neutral
X, y = get_classification_dataset()

clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

# ---

movie_reviews = [
    "This movie was absolutely wonderful. The storyline was compelling and the characters were very realistic.",
    "I really loved the film! The plot had a few unexpected twists which kept me engaged till the end.",
    "The movie was alright. Not great, but not bad either. A decent one-time watch.",
    "I didn't enjoy the film that much. The plot was quite predictable and the characters lacked depth.",
    "This movie was not to my taste. It felt too slow and the storyline wasn't engaging enough.",
    "The film was okay. It was neither impressive nor disappointing. It was just fine.",
    "I was blown away by the movie! The cinematography was excellent and the performances were top-notch.",
    "I didn't like the movie at all. The story was uninteresting and the acting was mediocre at best.",
    "The movie was decent. It had its moments but was not consistently engaging."
]

movie_review_labels = [
    "positive", 
    "positive", 
    "neutral", 
    "negative", 
    "negative", 
    "neutral", 
    "positive", 
    "negative", 
    "neutral"
]

new_movie_reviews = [
    # A positive review
    "The movie was fantastic! I was captivated by the storyline from beginning to end.",

    # A negative review
    "I found the film to be quite boring. The plot moved too slowly and the acting was subpar.",

    # A neutral review
    "The movie was okay. Not the best I've seen, but certainly not the worst."
]

clf.fit(X=movie_reviews, y=movie_review_labels)  

# Use the trained classifier to predict the sentiment of the new reviews
predicted_movie_review_labels = clf.predict(X=new_movie_reviews)  

for review, sentiment in zip(new_movie_reviews, predicted_movie_review_labels):
    print(f"Review: {review}\\nPredicted Sentiment: {sentiment}\\n\\n")

# result
Review: The movie was fantastic! I was captivated by the storyline from beginning to end.
Predicted Sentiment: positive

Review: I found the film to be quite boring. The plot moved too slowly and the acting was subpar.
Predicted Sentiment: negative

Review: The movie was okay. Not the best I've seen, but certainly not the worst.
Predicted Sentiment: neutral
  • Scikit-LLM 은 자동으로 OpenAI API 요청하고, 응답을 변화하여 label 결과를 뱉어냄

Example 2: Training without labeled data

  • label정보만 주고 labeled 되지 않은 데이터를 놓고 fit을 해도 사용 가능
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, _ = get_classification_dataset()

clf = ZeroShotGPTClassifier()
clf.fit(None, ["positive", "negative", "neutral"])
labels = clf.predict(X)
  • 기존의 지도학습과 다르게 zero-shot classifier은 label 자체의 의미를 기반으로 labeling 진행

Multi-Label Zero-Shot Text Classification

  • label을 여러개를 가진 케이스에 대해서 분류 가능
  • MultiLabelZeroShotGPTClassifier

Example:

from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

X, y = get_multilabel_classification_dataset()

clf = MultiLabelZeroShotGPTClassifier(max_labels=3)
clf.fit(X, y)
labels = clf.predict(X)

# ---
restaurant_reviews = [
    "The food was delicious and the service was excellent. A wonderful dining experience!",
    "The restaurant was in a great location, but the food was just average.",
    "The service was very slow and the food was cold when it arrived. Not a good experience.",
    "The restaurant has a beautiful ambiance, and the food was superb.",
    "The food was great, but I found it to be a bit overpriced.",
    "The restaurant was conveniently located, but the service was poor.",
    "The food was not as expected, but the restaurant ambiance was really nice.",
    "Great food and quick service. The location was also very convenient.",
    "The prices were a bit high, but the food quality and the service were excellent.",
    "The restaurant offered a wide variety of dishes. The service was also very quick."
]

restaurant_review_labels = [
    ["Food", "Service"],
    ["Location", "Food"],
    ["Service", "Food"],
    ["Atmosphere", "Food"],
    ["Food", "Price"],
    ["Location", "Service"],
    ["Food", "Atmosphere"],
    ["Food", "Service", "Location"],
    ["Price", "Food", "Service"],
    ["Food Variety", "Service"]
]

new_restaurant_reviews = [
    "The food was excellent and the restaurant was located in the heart of the city.",
    "The service was slow and the food was not worth the price.",
    "The restaurant had a wonderful ambiance, but the variety of dishes was limited."
]

from skllm import MultiLabelZeroShotGPTClassifier

# Initialize the classifier with the OpenAI model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# Train the model 
clf.fit(X=restaurant_reviews, y=restaurant_review_labels)

# Use the trained classifier to predict the labels of the new reviews
predicted_restaurant_review_labels = clf.predict(X=new_restaurant_reviews)

for review, labels in zip(new_restaurant_reviews, predicted_restaurant_review_labels):
    print(f"Review: {review}\\nPredicted Labels: {labels}\\n\\n")

# result
Review: The food was excellent and the restaurant was located in the heart of the city.
Predicted Labels: ['Food', 'Location']

Review: The service was slow and the food was not worth the price.
Predicted Labels: ['Service', 'Price']

Review: The restaurant had a wonderful ambiance, but the variety of dishes was limited.
Predicted Labels: ['Atmosphere', 'Food Variety']
반응형

제로샷분류 처럼 데이터가 없어도 의미자체로 학습하여 활용할 수 있음 -> y 범위만 제공

from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

X, _ = get_multilabel_classification_dataset()
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety",
    "Customer Support",
    "Packaging",
    "User Experience",
    "Return Policy",
    "Product Information",
]
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)
clf.fit(None, [candidate_labels])
labels = clf.predict(X)

Few-Shot Text Classification

- label이 있는 학습데이터셋이 필요

- 학습데이터 셋은 적어도 가능 약 label 당 10개 정도?!

  ㄴ 충분히 큰 데이터셋으로 이미 학습되어 있기 때문에 가능

from skllm import FewShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = FewShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)`

- 모델이 재학습되어지는 것은 아니고, inference 하는 과정에서 학습데이터가 활용되어 지는 것일 뿐 -> 제로 샷 접근과 약간 비슷?

Dynamic Few-Shot Text Classification

- Dynamic Few-Shot은 동적으로 class별 N개의 표본을 선택하고, 이런 방식이 매우 큰 llm context window사이에서 효과적으로 작동할 수 있음 

- fitting하는 동안 전체 데이터 셋은 class나 vectorized별로 파티션이 만들어지고, inference과정에서 가장 근접한 것을 빠르게 찾는 annoy library가 활용됨

- During inference, the annoy library is used for fast neighbor lookup, which allows including only the most similar examples in the prompt.

from skllm import DynamicFewShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = DynamicFewShotGPTClassifier(n_examples=3)
clf.fit(X, y)
labels = clf.predict(X)

Text Classification with Google PaLM 2

At the moment 3 PaLM based models are available in test mode:

  • ZeroShotPaLMClassifier - zero-shot text classification with PaLM 2;
  • PaLMClassifier - fine-tunable text classifier with PaLM 2;
  • PaLM - fine-tunable estimator that can be trained on arbitrary text input-output pairs.
from skllm.models.palm import PaLMClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = PaLMClassifier(n_update_steps=100)
clf.fit(X, y)
labels = clf.predict(X)`

A more detailed documentation will follow soon. For now, please refer to our official guide on Medium.

Text Vectorization

- 텍스트 임베딩하는 과정도 사용할 수 있음

Embedding the text

from skllm.preprocessing import GPTVectorizer

model = GPTVectorizer()
vectors = model.fit_transform(X)`

Example 2: Combining the Vectorizer with the XGBoost Classifier in a Sklearn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

steps = [("GPT", GPTVectorizer()), ("Clf", XGBClassifier())]
clf = Pipeline(steps)
clf.fit(X_train, y_train_encoded)
yh = clf.predict(X_test)`

 

Text Summarization

- summarization 요약기능도 쓸 수 있음!

 

- extractive abstractive 방식인 것으로 보임

- focus를 줄 단어나 컨셉을 지정해줄 수 있음

reviews = [
    "I dined at The Gourmet Kitchen last night and had a wonderful experience. The service was impeccable, the food was exquisite, and the ambiance was delightful. I had the seafood pasta, which was cooked to perfection. The wine list was also quite impressive. I would highly recommend this restaurant to anyone looking for a fine dining experience.",
    "I visited The Burger Spot for lunch today and was pleasantly surprised. Despite being a fast food joint, the quality of the food was excellent. I ordered the classic cheeseburger and it was juicy and flavorful. The fries were crispy and well-seasoned. The service was quick and the staff was friendly. It's a great place for a quick and satisfying meal.",
    "The Coffee Corner is my favorite spot to work and enjoy a good cup of coffee. The atmosphere is relaxed and the coffee is always top-notch. They also offer a variety of pastries and sandwiches. The staff is always welcoming and the service is fast. I enjoy their latte and the blueberry muffin is a must-try."
]

from skllm.preprocessing import GPTSummarizer

# Initialize the GPT summarizer model
gpt_summarizer = GPTSummarizer(openai_model = "gpt-3.5-turbo", max_words = 15)

summaries = gpt_summarizer.fit_transform(reviews)

# result
['The Gourmet Kitchen offers impeccable service, exquisite food, delightful ambiance, and impressive wine list. Highly recommended.'
 'The Burger Spot offers excellent quality fast food with friendly service.'
 'The Coffee Corner is a great place to work with good coffee and food.']

Text Translation

- 번역 기능도 지원함!!

from skllm.preprocessing import GPTTranslator
from skllm.datasets import get_translation_dataset

X = get_translation_dataset()
t = GPTTranslator(openai_model="gpt-3.5-turbo", output_language="English")
translated_text = t.fit_transform(X)`

 

참고

https://github.com/iryna-kondr/scikit-llm

 

GitHub - iryna-kondr/scikit-llm: Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text

Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks. - GitHub - iryna-kondr/scikit-llm: Seamlessly integrate powerful language models like ...

github.com

https://mlengineeringplace.com/scikitllm-a-powerful-combination-of-sklearn-and-llms/

https://medium.com/@fareedkhandev/scikit-llm-sklearn-meets-large-language-models-11fc6f30e530

https://ealizadeh.com/blog/tutorial-scikit-llm/

 

728x90
반응형
LIST
Comments