Quora Classification (1)

지나간 Kaggle 스터디도 좋지만 진행되고 있는 Kaggle 대회에 직접 참여해보면서 공부하면 일정도 관리되고 재미있을 것 같아서 최근 시작한 Quora Classification 대회에 도전해보기로 했다. deadline은 내년 2월 5일까지라서 아직 기간이 좀 남은 편이다.
Kaggle 공부 방법은 좋은 kernel 따라하기 + 내가 아는 방법 적용해보기로 정했다.
시작된지 몇 주 되긴했는데 이제야 시작하니 속도를 좀 내봐야겠다. 화이팅~~!

Description

  • 목표
    Quora site에 등록되어 있는 comments 중에서 toxic / normal comments 구별하기 (sincere / insincere)

  • insincere comments 기준
    Has a non-neutral tone
    Has an exaggerated tone to underscore a point about a group of people
    Is rhetorical and meant to imply a statement about a group of people
    Is disparaging or inflammatory
    Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
    Makes disparaging attacks/insults against a specific person or group of people
    Based on an outlandish premise about a group of people
    Disparages against a characteristic that is not fixable and not measurable
    Isn’t grounded in reality
    Based on false information, or contains absurd assumptions
    Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

  • 기타
    제공되는 데이터 셋 이외의 외부 데이터 셋은 이용할 수 없음

Data Analysis

import pandas as pd
from tqdm import tqdm
tqdm.pandas()

데이터 로드

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
print("Train shape : ",train.shape)
print("Test shape : ",test.shape)
Train shape :  (1306122, 3)
Test shape :  (56370, 2)

bulid_vocab : 단어 / 발생 횟수 조합 함수
tqdm (disable = False) : default 값이 False, 필요에 따라 progress bar 보기 원치 않을 경우 True로 셋팅

def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

tqdm 등록 시, apply 대신 progress_apply, map 대신 progress_map을 사용하여 진행 상황을 파악할 수 있다.

sentences = train["question_text"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:5]})
100%|██████████| 1306122/1306122 [00:06<00:00, 202291.02it/s]
100%|██████████| 1306122/1306122 [00:06<00:00, 210658.04it/s]

{'How': 261930, 'did': 33489, 'Quebec': 97, 'nationalists': 91, 'see': 9003}

단어별 vector값 확보하기 위해 pretrained된 Google Word2Vec 사용

from gensim.models import KeyedVectors

news_path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embeddings_index = KeyedVectors.load_word2vec_format(news_path, binary=True)

Google Word2vec은 normal sentence를 이용하여 pretrained된 모델이고, Word2vec의 경우 out of vocabulary에 대한 취약점이 크기 때문에 아래의 check_coverage 함수를 이용하여 out of vocabulary 단어들 검토

print({k: vocab[k] for k in list(vocab)[:5]})
print(embeddings_index['How'])
{'How': 261930, 'did': 33489, 'Quebec': 97, 'nationalists': 91, 'see': 9003}
[ 0.16015625  0.21679688  0.05493164  0.20703125 -0.15332031  0.13378906
  0.10693359  0.01831055  0.140625   -0.10253906  0.1796875  -0.14453125
 -0.34570312 -0.0703125  -0.25976562  0.21679688  0.07714844  0.19726562
  0.05371094 -0.04296875 -0.03222656  0.03173828  0.46875    -0.19628906
 -0.1484375   0.13574219 -0.23046875 -0.06787109  0.12451172  0.08447266
  0.08349609  0.05981445 -0.03515625 -0.17089844  0.02526855  0.375
 -0.00842285  0.08984375 -0.03955078  0.27929688 -0.16601562  0.00601196
  0.13183594 -0.28125    -0.03515625  0.00384521 -0.01965332 -0.07861328
  0.06396484  0.26367188 -0.30273438  0.21582031  0.03662109 -0.18261719
  0.14941406 -0.05249023 -0.20996094  0.04956055  0.30664062  0.13476562
  0.15722656  0.00598145 -0.4453125  -0.03466797 -0.03320312 -0.23339844
  0.00071716  0.22363281 -0.10009766 -0.19921875  0.13085938  0.17480469
  0.18164062  0.01538086 -0.15722656 -0.40234375 -0.04638672  0.22363281
 -0.03173828  0.22265625 -0.15039062  0.0859375  -0.16992188  0.11962891
 -0.48632812 -0.2109375  -0.17773438  0.18359375 -0.03198242  0.22460938
 -0.078125    0.140625   -0.35742188  0.01635742 -0.07714844 -0.08056641
  0.20703125 -0.00521851  0.01055908  0.17578125  0.20214844  0.0071106
 -0.08349609  0.10791016 -0.203125   -0.02026367 -0.32617188 -0.07714844
 -0.10839844  0.1796875  -0.12304688 -0.00897217  0.06835938 -0.08154297
  0.40625     0.02636719 -0.125       0.14160156 -0.13183594  0.04833984
 -0.15429688  0.04833984  0.01184082 -0.06640625  0.13964844  0.18847656
 -0.16699219  0.20507812 -0.0456543  -0.29101562 -0.05126953 -0.19726562
 -0.00772095 -0.07128906  0.01470947 -0.3046875  -0.03442383 -0.22265625
 -0.00946045  0.20410156  0.24414062 -0.02368164 -0.08935547 -0.02539062
  0.25585938 -0.29882812  0.31445312 -0.37695312  0.31835938  0.03466797
  0.00318909 -0.02124023  0.02929688  0.05566406  0.16113281 -0.07617188
 -0.08447266 -0.02331543 -0.08691406 -0.10009766 -0.00216675  0.171875
  0.05493164  0.07519531  0.02990723 -0.01513672  0.15136719 -0.13183594
 -0.15039062  0.25       -0.18554688  0.22558594 -0.03686523 -0.22167969
 -0.19726562  0.10595703  0.19140625  0.01324463 -0.140625   -0.12597656
 -0.18359375  0.01226807  0.0189209  -0.1328125  -0.08251953  0.15234375
 -0.15332031 -0.01391602  0.14550781  0.03955078  0.0390625  -0.17285156
  0.15820312 -0.06030273  0.05029297 -0.27539062 -0.10986328 -0.00564575
  0.265625   -0.21386719  0.0703125   0.20703125 -0.29492188  0.29101562
 -0.078125    0.06005859 -0.19726562 -0.00070953 -0.09619141 -0.06396484
 -0.20117188  0.22070312 -0.06689453 -0.04174805 -0.296875    0.01495361
  0.22753906  0.24316406 -0.02246094 -0.01757812 -0.15625    -0.06542969
 -0.42578125 -0.01055908  0.10888672 -0.13378906  0.13476562 -0.24804688
 -0.01452637 -0.15136719  0.18652344 -0.0480957   0.04199219  0.02075195
 -0.05273438 -0.18457031 -0.02368164  0.09472656 -0.140625    0.02868652
 -0.11083984  0.09814453  0.20898438 -0.27929688 -0.28125    -0.12109375
  0.0201416   0.07177734  0.02233887 -0.04394531  0.27539062  0.1953125
  0.34179688  0.37695312  0.03637695 -0.26757812 -0.00549316 -0.05029297
 -0.05664062  0.0057373  -0.05029297  0.09814453 -0.20410156  0.06982422
 -0.08398438  0.17675781  0.03540039  0.04638672 -0.06298828  0.03076172
 -0.10693359  0.17089844  0.28515625  0.19921875 -0.16992188 -0.13574219
 -0.01879883 -0.125       0.04516602  0.13964844  0.26171875  0.04907227
 -0.07177734  0.453125    0.01843262 -0.19238281 -0.359375   -0.20214844
 -0.0324707   0.39648438 -0.13476562  0.09716797 -0.11865234 -0.02001953
 -0.09472656 -0.41796875  0.06689453  0.04541016 -0.08105469  0.08935547]

train data의 24.31% 단어가 Google embedding 모델에 포함 됨.
문장 전체에서 78.75%가 Google embedding에 포함된 단어, 22.25% 단어는 out of vocabulary

import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]            
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x
oov = check_coverage(vocab,embeddings_index)
100%|██████████| 508823/508823 [00:01<00:00, 273530.30it/s]


Found embeddings for 24.31% of vocab
Found embeddings for  78.75% of all text

단어를 벡터값으로 치환할 Google embedding 모델과 train data의 단어 매칭이 24% 밖에 되지 않음, 전체 문장에서 21%의 데이터는 embedding 값이 default/random 값으로 치환되므로 개선이 필요함.
out of word를 살펴보면 아래와 같음

oov[:10]
[('to', 406298),
 ('a', 403852),
 ('of', 332964),
 ('and', 254081),
 ('2017', 8781),
 ('2018', 7373),
 ('10', 6642),
 ('12', 3694),
 ('20', 2942),
 ('100', 2883)]

out of vocabulary 단어들을 살펴보면 ‘to a, of…‘와 같은 stop words, 숫자, 특수 기호등이 포함됨 보통 stop words, 숫자, 특수 기호 등은 모두 NLP 전처리 과정에서 모두 삭제되지만, google embedding 모델을 살펴보았을 때 특수기호가 포함되는 경우도 있으므로 경우에 따라 처리함 (& 같은 경우)

'?' in embeddings_index
False
'&' in embeddings_index
True
def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

전처리 된 문장으로 build_vocab, check_coverage 재수행
embedding word는 57.38%, 전체 문장 coverage는 89.99%까지 개선됨

train["question_text"] = train["question_text"].progress_apply(lambda x: clean_text(x))
sentences = train["question_text"].apply(lambda x: x.split())
vocab = build_vocab(sentences)
100%|██████████| 1306122/1306122 [00:11<00:00, 111236.41it/s]
100%|██████████| 1306122/1306122 [00:05<00:00, 234179.16it/s]
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 253623/253623 [00:01<00:00, 232458.35it/s]


Found embeddings for 57.38% of vocab
Found embeddings for  89.99% of all text
oov[:10]
[('to', 406298),
 ('a', 403852),
 ('of', 332964),
 ('and', 254081),
 ('2017', 8781),
 ('2018', 7373),
 ('10', 6642),
 ('12', 3694),
 ('20', 2942),
 ('100', 2883)]

Google embedding의 top 10 word 살펴보기 (gensim의 index2entity 함수 사용)
신기하게도 to, a, of, and 등은 out of voca인데 in, for, that, is, on 등은 embedding에 포함되어 있음
##도 포함되어 있는데 2자리 숫자를 ##로 치환한 것으로 보임

for i in range(10):
    print(embeddings_index.index2entity[i])
</s>
in
for
that
is
on
##
The
with
said

숫자를 Google embedding format과 맞춰주는 전처리 함수

import re

def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x
train["question_text"] = train["question_text"].progress_apply(lambda x: clean_numbers(x))
sentences = train["question_text"].progress_apply(lambda x: x.split())
vocab = build_vocab(sentences)
100%|██████████| 1306122/1306122 [00:15<00:00, 86175.69it/s]
100%|██████████| 1306122/1306122 [00:06<00:00, 214020.03it/s]
100%|██████████| 1306122/1306122 [00:05<00:00, 230907.35it/s]
oov = check_coverage(vocab,embeddings_index)
100%|██████████| 242997/242997 [00:01<00:00, 186576.20it/s]


Found embeddings for 60.41% of vocab
Found embeddings for  90.75% of all text

특수기호 / 숫자 전처리까지 포함하여 단어별 60.41%, 문장별 90.75%까지 coverage 높아짐
남은 out of voca를 보면 아래와 같음

oov[:20]
[('to', 406298),
 ('a', 403852),
 ('of', 332964),
 ('and', 254081),
 ('favourite', 1247),
 ('bitcoin', 987),
 ('colour', 976),
 ('doesnt', 918),
 ('centre', 886),
 ('Quorans', 858),
 ('cryptocurrency', 822),
 ('Snapchat', 807),
 ('travelling', 705),
 ('counselling', 634),
 ('btech', 632),
 ('didnt', 600),
 ('Brexit', 493),
 ('cryptocurrencies', 481),
 ('blockchain', 474),
 ('behaviour', 468)]

out of voca를 보니 stop words와 misspell 단어들 존재

추후에 추가할 것들

  • train 뿐만 아니라 test data까지 전처리 모두 포함하여 진행, misspell 전처리도 추가
  • 축약형에 대한 full words로 진행 (don’t -> do not, haven’t -> have not 등)
def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


mispell_dict = {'colour':'color',
                'centre':'center',
                'didnt':'did not',
                'doesnt':'does not',
                'isnt':'is not',
                'shouldnt':'should not',
                'favourite':'favorite',
                'travelling':'traveling',
                'counselling':'counseling',
                'theatre':'theater',
                'cancelled':'canceled',
                'labour':'labor',
                'organisation':'organization',
                'wwii':'world war 2',
                'citicise':'criticize',
                'instagram': 'social medium',
                'whatsapp': 'social medium',
                'snapchat': 'social medium'

                }
mispellings, mispellings_re = _get_mispell(mispell_dict)

def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)
train["question_text"] = train["question_text"].progress_apply(lambda x: replace_typical_misspell(x))
sentences = train["question_text"].progress_apply(lambda x: x.split())
to_remove = ['a','to','of','and']
sentences = [[word for word in sentence if not word in to_remove] for sentence in tqdm(sentences)]
vocab = build_vocab(sentences)
100%|██████████| 1306122/1306122 [00:05<00:00, 225847.72it/s]
100%|██████████| 1306122/1306122 [00:04<00:00, 264526.59it/s]
100%|██████████| 1306122/1306122 [00:04<00:00, 261921.71it/s]
100%|██████████| 1306122/1306122 [00:06<00:00, 208906.12it/s]
oov = check_coverage(vocab,embeddings_index)
100%|██████████| 242935/242935 [00:01<00:00, 178090.85it/s]


Found embeddings for 60.43% of vocab
Found embeddings for  98.96% of all text

stop words, misspell 단어까지 모두 처리하니 coverage가 단어별60.43%, 문장별 98.96%까지 높아짐

oov[:20]
[('bitcoin', 987),
 ('Quorans', 858),
 ('cryptocurrency', 822),
 ('Snapchat', 807),
 ('btech', 632),
 ('Brexit', 493),
 ('cryptocurrencies', 481),
 ('blockchain', 474),
 ('behaviour', 468),
 ('upvotes', 432),
 ('programme', 402),
 ('Redmi', 379),
 ('realise', 371),
 ('defence', 364),
 ('KVPY', 349),
 ('Paytm', 334),
 ('grey', 299),
 ('mtech', 281),
 ('Btech', 262),
 ('bitcoins', 254)]

out of voca를 살펴보면 신조어, 동의어, 유사어 등이 보임
이런 것들은 occurance count가 높지않아 google embedding에 포함되기 어려움

추후 활동

  • data 전처리 시 test data 패턴도 고려
  • 축약형 phrase 구 full words로 치환
  • imbalanced data 처리
  • google embedding 뿐만 아니라 glove, wiki embedding에서 처리할 수 있는 out of voca 및 공통 단어 찾기

Reference

  • https://www.kaggle.com/c/quora-insincere-questions-classification
  • https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings

댓글남기기