[Text preprocessing] 텍스트 데이터의 encoding 형식을 알아내기

Dan-k 2020. 5. 28. 16:59

보통 읽을 파일의 인코딩형태를 확인하기 위해 cchardet 라이브러리를 사용한다.

import cchardet

def encoding_type(file_path):
    with open(file_path, 'rb') as f:
            data = f.read()
    encoding = cchardet.detect(data)['encoding']
    print('encoding type:',encoding)
    return encoding
    
    
=======
UTF-8

읽을 텍스트 데이터의 인코딩 형식을 알아내어, 해당 인코딩 형식에 맞게 데이터를 읽어올 필요가 있다.
왜냐면 한글의 경우, utf-8 로 인코딩하면 대부분 잘 동작하지만, 때로는 utf-1, uhc로 인코딩된 경우도 있기 때문에
쉽지가 않다.

한 폴더에 있는 모든 파일들에 대해 읽고 정해진 pattern을 갖는 문장들을 모으는 작업을 하고 싶다면
아래 처럼 함수를 만들어 사용할 수 있다.

import cchardet
import os

def encoding_type(file_path):
    with open(file_path, 'rb') as f:
            data = f.read()
    encoding = cchardet.detect(data)['encoding']
    return encoding


def extract_sentence_include_pattern(file_path, pattern):
    with open(file_path, 'rb') as f:
        dat = f.read()
        encoding = encoding_type(dat)
    hand = open(file_path, 'r',encoding=encoding)
    for line in hand:
        line = line.rstrip()
        paragraph = kss.split_sentences(line)
        for i in paragraph:
            if re.search(pattern, i):
                contents.append(i)
    data = ','.join(contents)
    return data 


def preprocessing_paragraph_filter(file_path, pattern):
    file_list = os.listdir(file_list_path) ## 해당 folder 경로에 존재하는 파일 리스트
    sentences_lst = []
    for i in file_list:
        file_path = file_list_path+ i
        sentences = extract_sentence_include_pattern(file_path, pattern)
        sentences_lst.append(sentences)
    return sentences_lst

출처 : https://github.com/PyYoshi/cChardet

PyYoshi/cChardet

universal character encoding detector. Contribute to PyYoshi/cChardet development by creating an account on GitHub.

github.com

728x90

LIST