텍스트 데이터를 벡터로 변환하여 분석!¶

Word -> vectorize

Bag of Words -> CountVectorizer
tf-idf : Term Frequency / document Frequeny -> TfidfVectorizer
Wordembedding, SequenceVec

Doc2Vec

무엇이든 가장 단순하고 해석 가능한 것이 좋다!¶

BoW : 가장 단순 -> count를 가진 벡터 변환(word dictionary_vocab형성) ¶

오늘의 실습은 비교적 쉬운 영어로 된 예제만 진행¶

**[참고]**

한글 형태소 분석기¶

Konlpy : kkoma, mecab, twitter, konoran, hannanum
Kakao : khaiii -> 딥러닝 기반

형태소 분석기 성능비교

참고 : https://iostream.tistory.com/144
참고 : http://konlpy.org/ko/v0.4.3/morph

[기본 개념]¶

#### BoW : 텍스트 문서를 단어의 카운트를 이용하여 플랫벡터로 변환 시퀀스를 갖지 못함

단지 각 단어가 몇번 나타나는지만 확인

#### 즉 단어의 수(n)만큼의 n-dimensional한 vector 형성

문장의 의미 파괴할 수 있으며, 앞뒤 문맥을 고려하지 못함

#### Bag of n-grams : token n개의 시퀀스를 볼 수 있어, 약간은 문장의 의미 훼손을 줄일 수도 있음 ex) Bow : 1-gram(unigram) /// 2-grams(bigram) : Emma knocked, knocked on => 2개씩 토큰
연산량이 증가

import pandas as pd
import json
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Example 3-1. Computing n-grams¶

Dataset = https://www.yelp.com/dataset/download

biz, review dataset을 나눠서 실행

def load_json_df(filename, num_bytes = -1):
    '''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
    # windows에서 실행하는 경우 기본 인코딩이 cp949 이므로 encoding 값 지정해야 함.
    fs = open(filename, encoding='utf-8')
    df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
    fs.close()
    return df

biz_df = load_json_df('business.json')
biz_df.shape

(192609, 14)

biz_df.head()

# 처음 10,000개의 리뷰만 로드
f = open('review.json', encoding='utf-8')
js = []
for i in range(10000):
    js.append(json.loads(f.readline()))
f.close()

review_df = pd.DataFrame(js)
review_df.shape

(10000, 9)

review_df.head()

review_df['text'][0]

'Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.'

unigrams, bigrams, trigrams에 대한 피쳐 변환기 생성.
기본 옵션은 하나의 문자로 된 단어를 무시한다. 
실전에서는 의미없는 단어들을 제거하기 때문에 매우 유용하지만, 
이 예제에서는 설명을 위해 명시적으로 포함시킨다.

# bag of words

bow_converter = CountVectorizer(token_pattern='(?u)\\b\\w+\\b') #re 구문 참고...
## def로 함수를 만들어서 내입맛에 맞게 토큰나이저 가능 CountVectorizer(tokennizer = myfunc, stop_words = mylist)
x = bow_converter.fit_transform(review_df['text'])
x

<10000x26290 sparse matrix of type '<class 'numpy.int64'>'
	with 723114 stored elements in Compressed Sparse Row format>

bag of words 단어 보려면

get_feature_name 쓰면 됨¶

분석가가 max_feature_length를 지정해줄 수 있음¶

words = bow_converter.get_feature_names()
len(words)

26290

from random import *
a = randint(5000, 10000)
print(a)
print(words[a:a+10])

8121
['emma', 'emmener', 'emmy', 'emo', 'emotion', 'emotional', 'emotionally', 'empanada', 'empanadas', 'empathetic']

# words[:10]

['0', '00', '000', '00am', '00pm', '01', '01pm', '02', '025', '03']

# bigrams

bigram_converter = CountVectorizer(ngram_range=(2,2), token_pattern='(?u)\\b\\w+\\b')
x2 = bigram_converter.fit_transform(review_df['text'])
x2

<10000x316034 sparse matrix of type '<class 'numpy.int64'>'
	with 1038605 stored elements in Compressed Sparse Row format>

bigrams = bigram_converter.get_feature_names()
len(bigrams)

316034

### 여러나라말이 섞여서 원코드로 돌리면 일본어(?)가 나옴)
### 2 단어씩 묶여서 이루어진 것을 볼 수 있음
bigrams[-150:-140]

['zone you', 'zones in', 'zoo and', 'zoo but', 'zoo for', 'zoo is', 'zoo kind', 'zoo test', 'zoo that', 'zoo went']

# trigrams

trigram_converter = CountVectorizer(ngram_range=(3,3), token_pattern='(?u)\\b\\w+\\b')
x3 = trigram_converter.fit_transform(review_df['text'])
x3

<10000x738062 sparse matrix of type '<class 'numpy.int64'>'
	with 1070124 stored elements in Compressed Sparse Row format>

trigrams = trigram_converter.get_feature_names()
len(trigrams)

738062

trigrams[:10]

['0 1 0',
 '0 1 dessert',
 '0 1 good',
 '0 1 it',
 '0 1 prices',
 '0 1 speed',
 '0 10 mbps',
 '0 2 decor',
 '0 2 even',
 '0 2 it']

print (len(words), len(bigrams), len(trigrams))

26290 316034 738062

# 그림 3-6
sns.set_style("darkgrid")
counts = [len(words), len(bigrams), len(trigrams)]
plt.plot(counts, color='cornflowerblue')
plt.plot(counts, 'bo')
plt.margins(0.1)
plt.xticks(range(3), ['unigram', 'bigram', 'trigram'])
plt.tick_params(labelsize=14)
plt.title('Number of ngrams in the first 10,000 reviews of the Yelp dataset', fontsize=14)

Text(0.5,1,'Number of ngrams in the first 10,000 reviews of the Yelp dataset')

[정제된 피처를 위한 필터링]¶

Stopwords¶

Stopwords : filtering할 단어 리스트_의미없는...¶

분류,검색 : 대명사, 관사, 전치사의 가치가 크지 않을 수 있음
감성 분석 : 섬세한 의미가 중요

(a, an, and, the, i'll, 은, 는, 이, 가, 나, 그리고) 같은 분석에 큰 영향을 미치지 않을 거라 판단되는 단어들을 리스트에 저장 -> stopwords

영어 : NLTK에서 stopword 제공

한글 : stopword 제공하는 것은 딱히...경험상 직접 만들어 쓰는게 나을 수도¶

빈도기반 filtering¶

빈출 단어¶

  stopwords도 이에 포함될수 있음, 각 문서마다 빈번히 나타는 단어 
  -> 이를 통해 분석가가 판단하여 stopword list와 결함

희귀 단어¶

  희귀 단어 : 잘 알려져 있지 않거나, 철자가 틀린 것들 
            -> 휴지통을 만들어서 따로 카운트 가능

어간 추출(stemming)¶

 단어의 의미는 비슷하지만 표현 방식이 서로 다른 토큰은 다르게 카운트되면 안돼~
 어근으로 형태를 변형 시킨 후 카운트

stemming을 이용하면 단어의 의미가 매우 다름에도 불구하고 하나의 단어로 인식하는 경우가 있기 때문에 주의해야함!(ex. new, news ->new)

*참고 stemming은 룰 기반 구분 lemmatizing은 문맥을 고려해 어간 구분-> 문장에서 어떤 품사로 쓰였는지 고려해서 자름 stemming : flies -> 단순히 어근으로 구분 lemmatizing : flies -> '날다', '파리' 인지 문맥에 따라 고려해서 구분(시간이 오래걸림)

   한국어 -> 형태소분석기 사용

의미 단위¶

파싱과 토큰화¶

 파싱 : 필요한 부분을 따오는 것_웹 페이지 경우 여러정보가 있고 필요한 txt 따옴
 토큰화 : 문자열을 토큰의 시퀀스로 변환 -> BoW 만드는 과정, 영어의 경우 띄어쓰기(공백)을 이용ㅎ하여 주로 구분

연어 추출¶

연어 : 한 합성(복합)단어가 각 부분 단어의 합보다 큰 의미를 갖는 단어

<span style= "color:red"> Ex. strong tea = 강한(strong) + 차(tea) 라는 의미가 아니지. 이건 아닌듯</span>

cute pupy = cute + pupy -> 오키 인정!

연어 추출 방법
1. 직접 정의... -> 시간 많이 걸리고, 현실적이지 않음
2. 빈도 기반 : 빈번히 등장하는 n-gram이 연어는 아닌 문제가 있음
3. 우도비율 검정을 통한 방법 : 우도비율에 따라 bigram을 오름차순 정렬하여 상위 feature 뽑기 -> 한글에선...사실상 불가능할 것 같음... => <span style= "color:green"> 우도비율검정에 대해 설명해주실 분 구함</span>

청킹과 품사태깅(pos-tagging)¶

청킹 : 품사를 인지하고 토큰_한글 morph
품사태깅 : 청킹을 통해 토큰한 단어들과 그 단어의 품사가 태깅된 것을 추출

참고) http://konlpy.org/ko/v0.4.3/morph/

특히 한글은 라이브러리마다 분석형태가 매우 달라서 잘...잘..아주 잘...분석해야함...stopword도 잘 설정하고...

ps. 개인적으로 mecab에 사전 추가해서 사용¶

Example 3-2. PoS tagging and chunking¶

# 처음 10개의 리뷰 로드
f = open('review.json', encoding='utf-8')
js = []
for i in range(10):
    js.append(json.loads(f.readline()))
f.close()

review_df = pd.DataFrame(js)
review_df.shape

(10, 9)

1. spacy 사용 : spacy 설치 가이드 ¶

import spacy
# 언어 모델 로드
nlp = spacy.load('en')
print(spacy.info('en'))
# model meta data
# error 날 경우, data 폴더에 파일이 없기 때문...
# 당황하지말고, 프롬프트창에
# python -m spacy download en
# 이것도 안된다면...아래로...

import en_core_web_sm
nlp = en_core_web_sm.load()

# 데이터 프레임에 적용
doc_df = review_df['text'].apply(nlp)
type(doc_df)

pandas.core.series.Series

type(doc_df[0])

spacy.tokens.doc.Doc

doc_df[0]

Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.

# spaCy는 품사(.pos_)와 태그(.tag_)를 모두 제공
for doc in doc_df[0]:
    print(doc.text, doc.pos_, doc.tag_)

Total ADJ JJ
bill NOUN NN
for ADP IN
this DET DT
horrible ADJ JJ
service NOUN NN
? PUNCT .
Over ADP IN
$ SYM $
8Gs NUM CD
. PUNCT .
These DET DT
crooks NOUN NNS
actually ADV RB
had VERB VBD
the DET DT
nerve NOUN NN
to PART TO
charge VERB VB
us PRON PRP
$ SYM $
69 NUM CD
for ADP IN
3 NUM CD
pills NOUN NNS
. PUNCT .
I PRON PRP
checked VERB VBD
online ADV RB
the DET DT
pills NOUN NNS
can VERB MD
be VERB VB
had VERB VBN
for ADP IN
19 NUM CD
cents NOUN NNS
EACH DET DT
! PUNCT .
Avoid VERB VB
Hospital NOUN NN
ERs NOUN NNS
at ADP IN
all DET DT
costs NOUN NNS
. PUNCT .

# spaCy는 명사구 추출 기능도 제공한다.
print([chunk for chunk in doc_df[0].noun_chunks])

[Total bill, this horrible service, These crooks, the nerve, us, 3 pills, I, the pills, 19 cents, Avoid Hospital, all costs]

2. Textblob 사용 : 설치 가이드 ¶

from textblob import TextBlob

# Textblob에서는 기본값으로 PatternTagger를 사용하며, 이 예제에서는 이걸로 충분하다.
# NLTK tagger를 사용할 수도 있지만 이것은 불완전한 문장에 대해 더 잘 동작한다.

blob_df = review_df['text'].apply(TextBlob)

type(blob_df)

pandas.core.series.Series

type(blob_df[0])

textblob.blob.TextBlob

blob_df[0].tags

[('Total', 'JJ'),
 ('bill', 'NN'),
 ('for', 'IN'),
 ('this', 'DT'),
 ('horrible', 'JJ'),
 ('service', 'NN'),
 ('Over', 'IN'),
 ('8Gs', 'CD'),
 ('These', 'DT'),
 ('crooks', 'NNS'),
 ('actually', 'RB'),
 ('had', 'VBD'),
 ('the', 'DT'),
 ('nerve', 'NN'),
 ('to', 'TO'),
 ('charge', 'VB'),
 ('us', 'PRP'),
 ('69', 'CD'),
 ('for', 'IN'),
 ('3', 'CD'),
 ('pills', 'NNS'),
 ('I', 'PRP'),
 ('checked', 'VBD'),
 ('online', 'RP'),
 ('the', 'DT'),
 ('pills', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('had', 'VBN'),
 ('for', 'IN'),
 ('19', 'CD'),
 ('cents', 'NNS'),
 ('EACH', 'CD'),
 ('Avoid', 'NNP'),
 ('Hospital', 'NNP'),
 ('ERs', 'NNP'),
 ('at', 'IN'),
 ('all', 'DT'),
 ('costs', 'NNS')]

# Textblob도 명사구 추출 가능
print([np for np in blob_df[0].noun_phrases])

['total bill', 'horrible service', '$ 8gs', 'each', 'avoid', 'ers']

[정리]¶

3장에서는 물에 발만 담그는 정도로 간단한 피처 생성 기법들을 살펴봄

	address	attributes	business_id	categories	city	hours	is_open	latitude	longitude	name	postal_code	review_count	stars	state
0	2818 E Camino Acequia Drive	{'GoodForKids': 'False'}	1SWheh84yJXfytovILXOAQ	Golf, Active Life	Phoenix	None	0	33.522143	-112.018481	Arizona Biltmore Golf Club	85016	5	3.0	AZ
1	30 Eglinton Avenue W	{'RestaurantsReservations': 'True', 'GoodForMe...	QXAEGFB4oINsVuTFxEYKFQ	Specialty Food, Restaurants, Dim Sum, Imported...	Mississauga	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...	1	43.605499	-79.652289	Emerald Chinese Restaurant	L5R 3E7	128	2.5	ON
2	10110 Johnston Rd, Ste 15	{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...	gnKjwL_1w79qoiV3IC_xQQ	Sushi Bars, Restaurants, Japanese	Charlotte	{'Monday': '17:30-21:30', 'Wednesday': '17:30-...	1	35.092564	-80.859132	Musashi Japanese Restaurant	28210	170	4.0	NC
3	15655 W Roosevelt St, Ste 237	None	xvX2CttrVhyG2z1dFg_0xw	Insurance, Financial Services	Goodyear	{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...	1	33.455613	-112.395596	Farmers Insurance - Paul Lorenz	85338	3	5.0	AZ
4	4209 Stuart Andrew Blvd, Ste F	{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...	HhyxOkGAM07SRYtlQ4wMFQ	Plumbing, Shopping, Local Services, Home Servi...	Charlotte	{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...	1	35.190012	-80.887223	Queen City Plumbing	28217	4	4.0	NC

	business_id	date	funny	review_id	stars	text	useful	user_id
0	ujmEBvifdJM6h6RLv4wQIg	2013-05-07 04:34:36	1	Q1sbwvVQXV2734tPgoKj4Q	1.0	Total bill for this horrible service? Over $8G...	6	hG7b0MtEbXx5QzbzE6C_VA
1	NZnhc2sEQy3RmzKTZnqtwQ	2017-01-14 21:30:33	0	GJXCdrto3ASJOqKeVWPi6Q	5.0	I adore Travis at the Hard Rock's new Kelly ...	0	yXQM5uF2jS6es16SJzNHfg
2	WTqjgwHlXbSFevF32_DJVw	2016-11-09 20:09:03	0	2TzJjDVDEuAW6MR5Vuc1ug	5.0	I have to say that this office really has it t...	3	n6-Gk65cPZL6Uz8qRm3NYw
3	ikCg8xy5JIg_NGPx-MSIDA	2018-01-09 20:56:38	0	yi0R0Ugj_xUx_Nek0-_Qig	5.0	Went in for a lunch. Steak sandwich was delici...	0	dacAIZ6fTM6mqwW5uxkskg
4	b1b1eb3uo-w561D0ZfCEiQ	2018-01-30 23:07:38	0	11a8sVPMUFtaC7_ABRkmtw	1.0	Today was my second out of three sessions I ha...	7	ssoyf2_x0EQMed6fgHeMyQ

Ch.6 Dimensionality Reduction: Squashing the Data Pancake with PCA (0)	2020.05.06
Ch.5 Categorical Variables: Counting Eggs in theAge of Robotic Chickens (0)	2020.04.14
Ch4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf (0)	2020.04.02
Ch2. Fancy Tricks with Simple Numbers (0)	2020.03.27
Ch1. The Machine Learning Pipeline (0)	2020.03.27

데이터과학 삼학년

데이터과학 삼학년

Ch3. Text Data: Flattening, Filtering,and Chunking 본문

Ch3. Text Data: Flattening, Filtering,and Chunking

형태소 분석기 성능비교

[기본 개념]¶

Example 3-1. Computing n-grams¶

bag of words 단어 보려면

get_feature_name 쓰면 됨¶

분석가가 max_feature_length를 지정해줄 수 있음¶

[정제된 피처를 위한 필터링]¶

Stopwords¶

Stopwords : filtering할 단어 리스트_의미없는...¶

한글 : stopword 제공하는 것은 딱히...경험상 직접 만들어 쓰는게 나을 수도¶

빈도기반 filtering¶

빈출 단어¶

희귀 단어¶

어간 추출(stemming)¶

의미 단위¶

파싱과 토큰화¶

연어 추출¶

청킹과 품사태깅(pos-tagging)¶

ps. 개인적으로 mecab에 사전 추가해서 사용¶

Example 3-2. PoS tagging and chunking¶

1. spacy 사용 : spacy 설치 가이드 ¶

2. Textblob 사용 : 설치 가이드 ¶

'Feature Engineering' 카테고리의 다른 글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

데이터과학 삼학년

Ch3. Text Data: Flattening, Filtering,and Chunking 본문

Ch3. Text Data: Flattening, Filtering,and Chunking

형태소 분석기 성능비교

[기본 개념]¶

Example 3-1. Computing n-grams¶

bag of words 단어 보려면

get_feature_name 쓰면 됨¶

분석가가 max_feature_length를 지정해줄 수 있음¶

[정제된 피처를 위한 필터링]¶

Stopwords¶

Stopwords : filtering할 단어 리스트_의미없는...¶

한글 : stopword 제공하는 것은 딱히...경험상 직접 만들어 쓰는게 나을 수도¶

빈도기반 filtering¶

빈출 단어¶

희귀 단어¶

어간 추출(stemming)¶

의미 단위¶

파싱과 토큰화¶

연어 추출¶

청킹과 품사태깅(pos-tagging)¶

ps. 개인적으로 mecab에 사전 추가해서 사용¶

Example 3-2. PoS tagging and chunking¶

1. spacy 사용 : spacy 설치 가이드¶

2. Textblob 사용 : 설치 가이드¶

'Feature Engineering' 카테고리의 다른 글

티스토리툴바

1. spacy 사용 : spacy 설치 가이드 ¶

2. Textblob 사용 : 설치 가이드 ¶