데이터과학 삼학년

Ch.9 Back to the Feature: Building an Academic Paper Recommender 본문

Feature Engineering

Ch.9 Back to the Feature: Building an Academic Paper Recommender

Dan-k 2020. 6. 3. 14:45
반응형
Chapter 9

학술 논문 추천 시스템 구축

  • 인용문을 검색하고 싶지만 아직 Google Scholar를 모르는 사람에게 유용함
  • Microsoft Academic Graph Dataset 사용 Open Academic Graph
    • 논문 개수: 166,192,182개
    • 데이터셋 크기: 104GB
    • 컬럼 수: 18개

항목 기반 협업 필터링

  • Amazon에서 제품 추천을 위한 사용자 기반 알고리즘을 향상시키기 위해 처음 개발됨
  • 항목 간의 유사도를 기반으로 한 추천을 제공함
    1. 항목에 대한 정보를 일반화 함
    2. 항목 간의 유사도 점수를 계산함
    3. 점수 기반의 순위를 통해 상위 $N$개의 유사 항목을 추천함

첫 번째 단계: 데이터 가져오기, 정제하기, 피처 파싱하기

  • 가설: 거의 같은 시기에 비슷한 연구 분야에서 출간된 논문이 사용자에게 가장 유용할 것이라고 가정함
    • 전체 데이터셋의 하위 샘플에서 이와 관련된 필드를 파싱하는 단순한 접근법 사용
  • 유사도: 코사인 유사도
    • 보수?인 코사인 거리 사용
    • $D_C(A,B)=1 - S_C(A,B)$

학술 논문 추천 시스템: 단순 접근법

  • 데이터셋 확인 및 사용할 영역 확인

예제 9-1데이터 임포트 및 필터링

In [0]:
import pandas as pd
In [0]:
model_df = pd.read_json('mag_papers_0.txt', lines=True) # 100만
model_df.shape
Out[0]:
(1000000, 19)
In [0]:
# 2만 건의 데이터만 예제에서 사용
df20000 = model_df.iloc[:20000,:]
df20000.shape
Out[0]:
(20000, 19)
In [0]:
df20000.to_json('mag_subset20K.txt', orient='records', lines=True)
In [0]:
model_df = pd.read_json('mag_subset20K.txt', lines=True)
model_df.shape
Out[0]:
(20000, 19)
In [0]:
model_df.columns
Out[0]:
Index(['abstract', 'authors', 'doc_type', 'doi', 'fos', 'id', 'issue',
       'keywords', 'lang', 'n_citation', 'page_end', 'page_start', 'publisher',
       'references', 'title', 'url', 'venue', 'volume', 'year'],
      dtype='object')
In [0]:
# 영어가 아닌 논문은 제외.
# 제목이 중복인 것 제외
model_df = model_df[model_df.lang == 'en'].drop_duplicates(subset='title', keep='first')

# abstract, authors, fos, keywords, year, title 컬럼만 사용
model_df = model_df.drop(['doc_type', 
                          'doi', 'id', 
                          'issue', 'lang', 
                          'n_citation', 
                          'page_end', 
                          'page_start', 
                          'publisher', 
                          'references',
                          'url', 
                          'venue', 
                          'volume'], axis=1)
In [0]:
# 최종적으로 약 1만개의 논문만 사용한다.
model_df.shape
Out[0]:
(10399, 6)
In [0]:
model_df.head(2)
Out[0]:
abstract authors fos keywords title year
0 A system and method for maskless direct write ... None [Electronic engineering, Computer hardware, En... None System and Method for Maskless Direct Write Li... 2015
1 None [{'name': 'Ahmed M. Alluwaimi'}] [Biology, Virology, Immunology, Microbiology] [paratuberculosis, of, subspecies, proceedings... The dilemma of the Mycobacterium avium subspec... 2016
In [0]:
for col in model_df.columns:
    print('{}: {}'.format(col, model_df[col].isnull().sum()))
abstract: 4393
authors: 1
fos: 1733
keywords: 4294
title: 0
year: 0
필드명 설명 필드 타입 # NaN
abstract 논문 초록 string 4393
authors 저자 이름과 소속 list of dict, keys = name, org 1
fos 연구 분야(fields of study) list of strings 1733
keywords 키워드 list of strings 4294
title 논문 제목 string 0
year 출간 년도 int 0

예제 9-2협업 필터링 단계 1: 항목 피처 행렬 생성

In [0]:
# 연구분야
unique_fos = sorted(list({feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

# 출간년도
unique_year = sorted(model_df['year'].astype('str').unique())

print('unique_fos  :', len(unique_fos))
print('unique_year :', len(unique_year))
print('total       :', len(unique_fos + unique_year))
unique_fos  : 7604
unique_year : 156
total       : 7760
In [0]:
def feature_array(x, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
            else:
                if unique_array[j] == str(x[i]):
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df
In [0]:
year_features = feature_array(model_df['year'], unique_year)
year_features.shape
Out[0]:
(10399, 156)
In [0]:
year_features.head()
Out[0]:
1831 1832 1833 1834 1836 1837 1840 1841 1845 1847 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 156 columns

In [0]:
fos_features = feature_array(model_df['fos'], unique_fos)
fos_features.shape
Out[0]:
(10399, 7604)
In [0]:
fos_features.head()
Out[0]:
0 0-10 V lighting control 1/N expansion 10G-PON 14-3-3 protein 2-choice hashing 20th-century philosophy 2D computer graphics 2DEG 3-D Secure ... k-nearest neighbors algorithm m-derived filter microRNA pH photoperiodism route strictfp string Ćuk converter μ operator
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 7604 columns

In [0]:
first_features = fos_features.join(year_features).T
first_features.shape
Out[0]:
(7760, 10399)
In [0]:
first_features.head()
Out[0]:
0 1 2 5 7 8 9 10 11 12 ... 19985 19986 19987 19988 19993 19994 19995 19997 19998 19999
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
0-10 V lighting control 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1/N expansion 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10G-PON 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
14-3-3 protein 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 10399 columns

In [0]:
from sys import getsizeof

print('original array: {}'.format(getsizeof(model_df)))
print('first feature array: {}'.format(getsizeof(first_features)))
original array: 13892361
first feature array: 5003741538

예제 9-3협업 필터링 단계 2: 유사 항목 검색

In [0]:
from scipy.spatial.distance import cosine
In [0]:
def item_collab_filter(features_df):
    item_similarities = pd.DataFrame(index=features_df.columns, columns=features_df.columns)
    
    for i in features_df.columns:
        for j in features_df.columns:
            item_similarities.loc[i][j] = 1 - cosine(features_df[i].astype('float'), features_df[j].astype('float'))
    
    return item_similarities
In [0]:
%time first_items = item_collab_filter(first_features.loc[:, 0:1000])
Wall time: 12min 5s

예제 9-4논문 추천 시스템에 대한 히트맵

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [0]:
sns.set()
ax = sns.heatmap(first_items.fillna(0), 
                 vmin=0, vmax=1, 
                 cmap="YlGnBu", 
                 xticklabels=250, yticklabels=250)
ax.tick_params(labelsize=12)
  • 진한색일수록 유사한 항목
  • 대부분이 서로 비슷하지 않음

예제 9-5항목 기반 협업 필터링 추천

In [0]:
def paper_recommender(paper_index, items_df):
    print('Based on the paper: \nindex = ', paper_index)
    print(model_df.iloc[paper_index])
    top_results = items_df.loc[paper_index].sort_values(ascending=False).head(4)
    print('\nTop three results: ') 
    order = 1
    for i in top_results.index.tolist()[-3:]:
        print(order,'. Paper index = ', i)
        print('Similarity score: ', top_results[i])
        print(model_df.iloc[i], '\n')
        if order < 5: 
            order += 1
In [0]:
paper_recommender(2, first_items)
Based on the paper: 
index =  2
abstract                                                 None
authors     [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos                                                      None
keywords                                                 None
title       Should endometriosis be an indication for intr...
year                                                     2015
Name: 2, dtype: object

Top three results: 
1 . Paper index =  2
Similarity score:  1.0
abstract                                                 None
authors     [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos                                                      None
keywords                                                 None
title       Should endometriosis be an indication for intr...
year                                                     2015
Name: 2, dtype: object 

2 . Paper index =  292
Similarity score:  1.0
abstract                                                 None
authors     [{'name': 'John C. Newton'}, {'name': 'Beers M...
fos         [Wide area multilateration, Maneuvering speed,...
keywords                                                 None
title                    Automatic speed control for aircraft
year                                                     1955
Name: 561, dtype: object 

3 . Paper index =  593
Similarity score:  1.0
abstract    This paper demonstrates that on‐site greywater...
authors     [{'name': 'Eran Friedler', 'org': 'Division of...
fos         [Public opinion, Environmental Engineering, Wa...
keywords    [economic analysis, tratamiento desperdicios, ...
title       The water saving potential and the socio-econo...
year                                                     2008
Name: 1152, dtype: object 

  • 자기 자신이 가장 유사하게 나옴
  • 그러나, 나머지 두 논문은 유사하지 않은데 유사하게 나옴
  • 현재의 방법은 반복적인 엔지니어링을 위해서는 너무 느림
    • 다른 방법을 모색해야함

두 번째 단계: 피처 엔지니어링과 더 똑똑한 모델

  • 두 개의 피처에 더 나은 기법을 적용하고 더 빠르게 반복 처리를 하도록 항목 기반 협업 필터를 수정

학술 논문 추천 시스템: 테이크 2

  • 피처로 원시 카운트를 사용하면 유사성 척도를 사용하는 기법에서 문제가 있음

예제 9-6고정 폭 비닝 + 더미 코딩(파트 1)

In [0]:
model_df['year'].tail()
Out[0]:
19994    1951
19995    2017
19997    1971
19998    1986
19999    2015
Name: year, dtype: int64
In [0]:
print("Year spread: ", model_df['year'].min()," - ", model_df['year'].max())
print("Quantile spread:\n", model_df['year'].quantile([0.25, 0.5, 0.75]))
Year spread:  1831  -  2017
Quantile spread:
 0.25    1990.0
0.50    2005.0
0.75    2012.0
Name: year, dtype: float64
In [0]:
# year의 분포 확인
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(7, 5))
model_df['year'].hist(ax=ax, bins= model_df['year'].max() - model_df['year'].min())
ax.tick_params(labelsize=12)
ax.set_xlabel('Year Count', fontsize=12)
ax.set_ylabel('Occurrence', fontsize=12)
Out[0]:
Text(0, 0.5, 'Occurrence')

예제 9-7고정 폭 비닝 + 더미 코딩(파트 2)

In [0]:
# bin은 데이터의 수가 아니라 변수의 범위를 기준으로 설정한다.
model_df['year'].max() - model_df['year'].min()
Out[0]:
186
In [0]:
# year 피쳐를 10년 단위로 비닝
bins = int(round((model_df['year'].max() - model_df['year'].min()) / 10))

temp_df = pd.DataFrame(index=model_df.index)
temp_df['yearBinned'] = pd.cut(model_df['year'].tolist(), bins, precision=0)

# year 피쳐를 10년 단위로 비닝함으로써 피쳐 공간을 156에서 19로 줄인다.
print('We have reduced from', len(model_df['year'].unique()),
      'to', len(temp_df['yearBinned'].values.unique()), 'features representing the year.')
We have reduced from 156 to 19 features representing the year.
In [0]:
X_yrs = pd.get_dummies(temp_df['yearBinned'])
X_yrs.head()
Out[0]:
(1831.0, 1841.0] (1841.0, 1851.0] (1851.0, 1860.0] (1860.0, 1870.0] (1870.0, 1880.0] (1880.0, 1890.0] (1890.0, 1900.0] (1900.0, 1909.0] (1909.0, 1919.0] (1919.0, 1929.0] (1929.0, 1939.0] (1939.0, 1948.0] (1948.0, 1958.0] (1958.0, 1968.0] (1968.0, 1978.0] (1978.0, 1988.0] (1988.0, 1997.0] (1997.0, 2007.0] (2007.0, 2017.0]
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
In [0]:
# 비닝한 year의 분포 확인
sns.set_style('white')
fig, ax = plt.subplots(figsize=(7, 5))
X_yrs.sum().plot.bar(ax = ax)
ax.tick_params(labelsize=12)
ax.set_xlabel('Binned Years', fontsize=12)
ax.set_ylabel('Counts', fontsize=12)
Out[0]:
Text(0, 0.5, 'Counts')

예제 9-8Pandas 시리즈인 bag-of-phrases를 NumPy 희소 배열로 변환

In [0]:
X_fos = fos_features.values

# 각 객체의 크기를 보면 나중에 어떤 차이를 만들게 될지 예상할 수 있다.
print('Our pandas Series, in bytes: ', getsizeof(fos_features))
print('Our hashed numpy array, in bytes: ', getsizeof(X_fos))
Our pandas Series, in bytes:  4902998648
Our hashed numpy array, in bytes:  112

예제 9-9협업 필터링 단계 1+2: 항목 피처 행렬 생성, 유사 항목 검색

In [0]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
In [0]:
X_yrs.shape[1] + X_fos.shape[1]
Out[0]:
7623
In [0]:
# 10399 x 7623 array
%time second_features = np.append(X_fos, X_yrs, axis = 1)

second_size = getsizeof(second_features)
print('Size of second feature array, in bytes: ', second_size)
Wall time: 308 ms
Size of second feature array, in bytes:  634172728
In [0]:
print("The power of feature engineering saves us, in bytes: ", getsizeof(fos_features) - second_size)
The power of feature engineering saves us, in bytes:  4268825920
In [0]:
def piped_collab_filter(features_matrix, index, top_n):
                
    item_similarities = 1 - cosine_similarity(features_matrix[index:index+1], features_matrix).flatten() 
    related_indices = [i for i in item_similarities.argsort()[::-1] if i != index]

    return [(index, item_similarities[index]) for index in related_indices][0:top_n]

예제 9-10항목 기반 협업 필터링 추천 시스템: 테이크 2

In [0]:
def paper_recommender(items_df, paper_index, top_n):
    if paper_index in model_df.index:
        
        print('Based on the paper:')
        print('Paper index = ', model_df.loc[paper_index].name)
        print('Title :', model_df.loc[paper_index]['title'])
        print('FOS :', model_df.loc[paper_index]['fos'])
        print('Year :', model_df.loc[paper_index]['year'])
        print('Abstract :', model_df.loc[paper_index]['abstract'])
        print('Authors :', model_df.loc[paper_index]['authors'], '\n')
        
        # 요청된 DataFrame 인덱스에 대한 위치 인덱스 정의
        array_ix = model_df.index.get_loc(paper_index)

        top_results = piped_collab_filter(items_df, array_ix, top_n)
        
        print('\nTop',top_n,'results: ') 
        
        order = 1
        for i in range(len(top_results)):
            print(order,'. Paper index = ', model_df.iloc[top_results[i][0]].name)
            print('Similarity score: ', top_results[i][1])
            print('Title :', model_df.iloc[top_results[i][0]]['title'])
            print('FOS :', model_df.iloc[top_results[i][0]]['fos'])
            print('Year :', model_df.iloc[top_results[i][0]]['year'])
            print('Abstract :', model_df.iloc[top_results[i][0]]['abstract'])
            print('Authors :', model_df.iloc[top_results[i][0]]['authors'], '\n')
            if order < top_n: order += 1
    
    else:
        print('Whoops! Choose another paper. Try something from here: \n', model_df.index[100:200])
In [0]:
paper_recommender(second_features, 2, 3)
Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm injection (ICSI) in fresh IVF cycles
FOS : None
Year : 2015
Abstract : None
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, New York, NY'}, {'name': 'G.D. Palermo', 'org': 'Weill Medical College of Cornell University, New York, NY'}, {'name': 'Nigel Pereira', 'org': 'The Ronald O. Perelman and Claudia Cohen Center, New York, NY'}, {'name': 'Zev Rosenwaks', 'org': 'Weill Cornell Medical College, New York, NY'}] 


Top 3 results: 
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ventriculography and electroencephalography].
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : None
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}] 

2 . Paper index =  11771
Similarity score:  1.0
Title : A Study of Special Functions in the Theory of Eclipsing Binary Systems
FOS : ['Contact binary']
Year : 1981
Abstract : None
Authors : [{'name': 'Filaretti Zafiropoulos', 'org': 'University of Manchester'}] 

3 . Paper index =  11773
Similarity score:  1.0
Title : Studies of powder flow using a recording powder flowmeter and measurement of the dynamic angle of repose
FOS : None
Year : 1985
Abstract : This paper describes the utility of the dynamic measurement of the angle of repose for pharmaceutical systems, using a variable rotating cylinder to quantify powder flow. The dynamic angle of repose of sodium chloride powder sieve fractions was evaluated using a variable rotating cylinder. The relationship between the static and the dynamic angle of repose is discussed. The dynamic angle of repose of six lots of a multivitamin preparation were compared for inter- and intralot variation. In both cases, no significant differences (p> 0.05) were observed. In the multivitamin formulation, lubricants at lower concentration levels did not show a significant effect (p> 0.05) on the dynamic angle of repose when compared with flow rates. The effect of different hopper sizes and geometry has been evaluated using the recording powder flowmeter. The results indicate that although different hoppers affect the quantitative nature of the results, the same general trends are apparent. Thus, it appears possible to use a recording powder flowmeter with small quantities of material to predict the effect of formulation and processing variables on the flow of production scale quantities. This paper does not describe a comprehensive evaluation of the pharmaceutical utility of measuring the dynamic angle of repose. However, the results discussed are not encouraging and suggest that the recording powder flowmeter is more sensitive to the effects of formulation and production variables on powder flow.
Authors : [{'name': 'Ramachandra P. Hegde', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}, {'name': 'J.L. Rheingold', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'S. Welch', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'C. T. Rhodes', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881|||Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}] 

예제 9-11변환으로 인한 인덱스 할당의 변화

In [0]:
model_df.loc[21]
Out[0]:
abstract    A microprocessor includes hardware registers t...
authors                      [{'name': 'Mark John Ebersole'}]
fos         [Embedded system, Parallel computing, Computer...
keywords                                                 None
title       Microprocessor that enables ARM ISA program to...
year                                                     2013
Name: 21, dtype: object
In [0]:
model_df.iloc[21]
Out[0]:
abstract                                                 None
authors     [{'name': 'Nicola M. Heller'}, {'name': 'Steph...
fos         [Biology, Medicine, Post-transcriptional regul...
keywords    [glucocorticoids, post transcriptional regulat...
title       Post-transcriptional regulation of eotaxin by ...
year                                                     2002
Name: 30, dtype: object
In [0]:
model_df.index.get_loc(30)
Out[0]:
21

세 번째 단계: 추가 피처 = 추가 정보

  • 출간 년도와 연구 분야가 유사한 논문을 추천하기에 충분할 것이라는 가설을 지지하지 못하고 있음
    • 원본 데이터셋을 좀 더 많이 사용해 더 나은 결과를 얻을 수 있는지 확인 (크게 달라지지 않을 것)
    • 좋은 추천 결과를 제공할 수 있을 만큼 충분한 데이터인지 좀 더 시간을 들여 탐색 (이미 충분)
    • 피처 추가 (abstract와 authors에 초점을 둠)

학술 논문 추천 시스템: 테이크 3

  • tf-idf 이용

예제 9-12불용어 처리 + tf-idf

In [0]:
# sklearn을 사용하기 위해 NaN 항목을 채워준다.
filled_df = model_df.fillna('None')
filled_df.head()
Out[0]:
abstract authors fos keywords title year
0 A system and method for maskless direct write ... None [Electronic engineering, Computer hardware, En... None System and Method for Maskless Direct Write Li... 2015
1 None [{'name': 'Ahmed M. Alluwaimi'}] [Biology, Virology, Immunology, Microbiology] [paratuberculosis, of, subspecies, proceedings... The dilemma of the Mycobacterium avium subspec... 2016
2 None [{'name': 'Jovana P. Lekovich', 'org': 'Weill ... None None Should endometriosis be an indication for intr... 2015
5 None [{'name': 'George C. Sponsler'}] None None Should APS Discuss Public Issues: Direct, pers... 1968
7 Full textFull text is available as a scanned c... [{'name': 'M. T. Richards'}] [Medicine, Pathology, Gynecology, Surgery] [breast neoplasms, female, middle aged, adoles... Breast surgery--as an office procedure 1973
In [0]:
# abstract: 불용어, 빈도기반 필터링
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_abstract = vectorizer.fit_transform(filled_df['abstract'])
X_abstract
Out[0]:
<10399x48516 sparse matrix of type '<class 'numpy.float64'>'
	with 374055 stored elements in Compressed Sparse Row format>
In [0]:
print("n_samples: %d, n_features: %d" % X_abstract.shape)
X_yrs.shape[1] + X_fos.shape[1] + X_abstract.shape[1]
n_samples: 10399, n_features: 48516
Out[0]:
56139
In [0]:
# 10399 x 56139 array

%time third_features = np.append(second_features, X_abstract.toarray(), axis = 1)
Wall time: 9.69 s
In [0]:
paper_recommender(third_features, 2, 3)
Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm injection (ICSI) in fresh IVF cycles
FOS : None
Year : 2015
Abstract : None
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, New York, NY'}, {'name': 'G.D. Palermo', 'org': 'Weill Medical College of Cornell University, New York, NY'}, {'name': 'Nigel Pereira', 'org': 'The Ronald O. Perelman and Claudia Cohen Center, New York, NY'}, {'name': 'Zev Rosenwaks', 'org': 'Weill Cornell Medical College, New York, NY'}] 


Top 3 results: 
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ventriculography and electroencephalography].
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : None
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}] 

2 . Paper index =  11773
Similarity score:  1.0
Title : Studies of powder flow using a recording powder flowmeter and measurement of the dynamic angle of repose
FOS : None
Year : 1985
Abstract : This paper describes the utility of the dynamic measurement of the angle of repose for pharmaceutical systems, using a variable rotating cylinder to quantify powder flow. The dynamic angle of repose of sodium chloride powder sieve fractions was evaluated using a variable rotating cylinder. The relationship between the static and the dynamic angle of repose is discussed. The dynamic angle of repose of six lots of a multivitamin preparation were compared for inter- and intralot variation. In both cases, no significant differences (p> 0.05) were observed. In the multivitamin formulation, lubricants at lower concentration levels did not show a significant effect (p> 0.05) on the dynamic angle of repose when compared with flow rates. The effect of different hopper sizes and geometry has been evaluated using the recording powder flowmeter. The results indicate that although different hoppers affect the quantitative nature of the results, the same general trends are apparent. Thus, it appears possible to use a recording powder flowmeter with small quantities of material to predict the effect of formulation and processing variables on the flow of production scale quantities. This paper does not describe a comprehensive evaluation of the pharmaceutical utility of measuring the dynamic angle of repose. However, the results discussed are not encouraging and suggest that the recording powder flowmeter is more sensitive to the effects of formulation and production variables on powder flow.
Authors : [{'name': 'Ramachandra P. Hegde', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}, {'name': 'J.L. Rheingold', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'S. Welch', 'org': 'Formulation Research, Lederle Laboratories, Pearl River, NY 10965'}, {'name': 'C. T. Rhodes', 'org': 'Department of Pharmacy, University of Rhode Island, Kingston, RI 02881|||Department of Pharmacy, University of Rhode Island, Kingston, RI 02881'}] 

3 . Paper index =  11778
Similarity score:  1.0
Title : Direct antagonistic action of prostaglandins and phenylisopropyladenosine on the activity of pancreatic triglyceride lipase and the antagonistic effect of polyphloretin phosphate
FOS : ['Endocrinology', 'Biochemistry', 'Diabetes mellitus']
Year : 1974
Abstract : None
Authors : [{'name': 'Mentz P'}, {'name': 'Förster W'}, {'name': 'Giessler C'}] 

예제 9-13scikit-learn의 DictVectorizer를 사용한 원-핫 인코딩

In [0]:
authors_df = pd.DataFrame(filled_df.authors)
authors_df.head()
Out[0]:
authors
0 None
1 [{'name': 'Ahmed M. Alluwaimi'}]
2 [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
5 [{'name': 'George C. Sponsler'}]
7 [{'name': 'M. T. Richards'}]
In [0]:
authors_list = []

for row in authors_df.itertuples():
    # 각 시리즈 인덱스로부터 딕셔너리 생성 
    if type(row.authors) is str:
        y = {'None': row.Index}
    if type(row.authors) is list:
        # 이 키와 값을 딕셔너리에 추가
        y = dict.fromkeys(row.authors[0].values(), row.Index)
    authors_list.append(y)
In [0]:
authors_list[0:5]
Out[0]:
[{'None': 0},
 {'Ahmed M. Alluwaimi': 1},
 {'Jovana P. Lekovich': 2, 'Weill Cornell Medical College, New York, NY': 2},
 {'George C. Sponsler': 5},
 {'M. T. Richards': 7}]
In [0]:
v = DictVectorizer(sparse=False)
D = authors_list
X_authors = v.fit_transform(D)
X_authors
Out[0]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [0]:
print("n_samples: %d, n_features: %d" % X_authors.shape)
X_yrs.shape[1] + X_fos.shape[1] + X_abstract.shape[1] + X_authors.shape[1]
n_samples: 10399, n_features: 14028
Out[0]:
70167
In [0]:
# 10399 x 70167 array

%time fourth_features = np.append(third_features, X_authors, axis = 1)
Wall time: 7.39 s

예제 9-14항목 기반 협업 필터링 추천 시스템: 테이크 3

In [0]:
paper_recommender(fourth_features, 2, 3)
Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm injection (ICSI) in fresh IVF cycles
FOS : None
Year : 2015
Abstract : None
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, New York, NY'}, {'name': 'G.D. Palermo', 'org': 'Weill Medical College of Cornell University, New York, NY'}, {'name': 'Nigel Pereira', 'org': 'The Ronald O. Perelman and Claudia Cohen Center, New York, NY'}, {'name': 'Zev Rosenwaks', 'org': 'Weill Cornell Medical College, New York, NY'}] 


Top 3 results: 
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ventriculography and electroencephalography].
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : None
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}] 

2 . Paper index =  5601
Similarity score:  1.0
Title : 633 Survival after coronary revascularization, with and without mitral valve repair, in patients with ischemic mitral regurgitation. Importance of pre-operative myocardial viability
FOS : ['Cardiology']
Year : 2005
Abstract : None
Authors : [{'name': 'J.B. Le Polain De Waroux'}, {'name': 'Anne-Catherine Pouleur'}, {'name': 'B. Beige'}, {'name': 'Agnes Pasquet'}, {'name': 'B. Gerbe'}, {'name': 'B. Noirhomme'}, {'name': 'G. El Khoury'}, {'name': 'Jean Louis Vanoverschelde'}] 

3 . Paper index =  12256
Similarity score:  1.0
Title : Nucleotide Sequence and Analysis of an Insertion Sequence from Bacillus thuringiensis Related to IS150
FOS : ['Biology', 'Molecular biology', 'Insertion sequence', 'Nucleic acid sequence', 'Bioinformatics', 'Genetics']
Year : 1994
Abstract : A 5.8-kb DNA fragment encoding the  cryIC  gene from  Bacillus thuringiensis  (Bt) subsp.  aizawai  HD229 was subcloned into the pMex7 vector for expression in  Escherichia coli . In addition to the 135-kDa CryIC δ-endotoxin, this DNA fragment also encoded a 30-kDa polypeptide whose open reading frame ( orfX ) was located less than 200 bp upstream of  cryIC  . Nucleotide sequencing showed that  orfX  was truncated at the 5′ end, and full sequence was obtained from a second overlapping clone. Sequence analysis showed that  orfX  could encode a polypeptide closely related to the putative transposase from IS 150 . OrfX was flanked by a 17-bp imperfect inverted repeat, defining the length of the element as 998 bp. Southern blot analysis revealed that the novel insertion sequence was present in a single copy and located in an identical position immediately upstream of  cryIC  in plasmid DNA from both Bt subsp.  aizawai  and  entomocidus.
Authors : [{'name': 'Geoffrey P. Smith'}, {'name': 'David J. Ellar'}, {'name': 'Sharon J. Keeler'}, {'name': 'Cynthia E. Seip'}] 

  • 그나마 의학분야의 논문을 보여줌 (총 의학분야는 7,604개)
  • 제목에서 명사구를 추출하거나 키워드에서 어근(stem)을 추출하는 등 텍스트 변수를 더 추가하면 좋아질 것임
In [9]:
a = (0, 0, 0, 1)
b = (0, 0, 1, 0)

cosine(a, b) # distance
Out[9]:
1.0
In [10]:
help(cosine)
Help on function cosine in module scipy.spatial.distance:

cosine(u, v, w=None)
    Compute the Cosine distance between 1-D arrays.
    
    The Cosine distance between `u` and `v`, is defined as
    
    .. math::
    
        1 - \frac{u \cdot v}
                  {||u||_2 ||v||_2}.
    
    where :math:`u \cdot v` is the dot product of :math:`u` and
    :math:`v`.
    
    Parameters
    ----------
    u : (N,) array_like
        Input array.
    v : (N,) array_like
        Input array.
    w : (N,) array_like, optional
        The weights for each value in `u` and `v`. Default is None,
        which gives each value a weight of 1.0
    
    Returns
    -------
    cosine : double
        The Cosine distance between vectors `u` and `v`.
    
    Examples
    --------
    >>> from scipy.spatial import distance
    >>> distance.cosine([1, 0, 0], [0, 1, 0])
    1.0
    >>> distance.cosine([100, 0, 0], [0, 1, 0])
    1.0
    >>> distance.cosine([1, 1, 0], [0, 1, 0])
    0.29289321881345254

728x90
반응형
LIST
Comments