[Text preprocessing] 텍스트 데이터의 encoding 형식을 알아내기

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

데이터과학 삼학년

[Text preprocessing] 텍스트 데이터의 encoding 형식을 알아내기 본문

Natural Language Processing

[Text preprocessing] 텍스트 데이터의 encoding 형식을 알아내기

Dan-k 2020. 5. 28. 16:59

보통 읽을 파일의 인코딩형태를 확인하기 위해 cchardet 라이브러리를 사용한다.

import cchardet

def encoding_type(file_path):
    with open(file_path, 'rb') as f:
            data = f.read()
    encoding = cchardet.detect(data)['encoding']
    print('encoding type:',encoding)
    return encoding
    
    
=======
UTF-8

읽을 텍스트 데이터의 인코딩 형식을 알아내어, 해당 인코딩 형식에 맞게 데이터를 읽어올 필요가 있다.
왜냐면 한글의 경우, utf-8 로 인코딩하면 대부분 잘 동작하지만, 때로는 utf-1, uhc로 인코딩된 경우도 있기 때문에
쉽지가 않다.

한 폴더에 있는 모든 파일들에 대해 읽고 정해진 pattern을 갖는 문장들을 모으는 작업을 하고 싶다면
아래 처럼 함수를 만들어 사용할 수 있다.

import cchardet
import os

def encoding_type(file_path):
    with open(file_path, 'rb') as f:
            data = f.read()
    encoding = cchardet.detect(data)['encoding']
    return encoding


def extract_sentence_include_pattern(file_path, pattern):
    with open(file_path, 'rb') as f:
        dat = f.read()
        encoding = encoding_type(dat)
    hand = open(file_path, 'r',encoding=encoding)
    for line in hand:
        line = line.rstrip()
        paragraph = kss.split_sentences(line)
        for i in paragraph:
            if re.search(pattern, i):
                contents.append(i)
    data = ','.join(contents)
    return data 


def preprocessing_paragraph_filter(file_path, pattern):
    file_list = os.listdir(file_list_path) ## 해당 folder 경로에 존재하는 파일 리스트
    sentences_lst = []
    for i in file_list:
        file_path = file_list_path+ i
        sentences = extract_sentence_include_pattern(file_path, pattern)
        sentences_lst.append(sentences)
    return sentences_lst

출처 : https://github.com/PyYoshi/cChardet

PyYoshi/cChardet

universal character encoding detector. Contribute to PyYoshi/cChardet development by creating an account on GitHub.

github.com

728x90

LIST

'Natural Language Processing' 카테고리의 다른 글

[Text preprocessing] 문장 형태소별 토큰화 및 벡터화 (0)	2020.06.03
BTS 불건전 팬픽 분류 분석 (Naive Bayes, Logistic Regression, RNN) (0)	2020.06.01
[Text preprocessing] 한국어 문장 splitter (0)	2020.05.27
[Text preprocessing] Lemmatization and Stemming (0)	2020.03.24
[Text preprocessing] Cleaning and Normalization, Stopwords (0)	2020.03.24

'Natural Language Processing' Related Articles

Comments

데이터과학 삼학년

[Text preprocessing] 텍스트 데이터의 encoding 형식을 알아내기 본문

[Text preprocessing] 텍스트 데이터의 encoding 형식을 알아내기

'Natural Language Processing' 카테고리의 다른 글

티스토리툴바