ตัวอย่างการใช้ pandas#

Example : Blog gender dataset

เราต้องการเตรียมข้อมูลสำหรับการสร้างเครื่องแยกแยะว่า Blog post นั้นเขียนโดยผู้เขียนที่เป็นผู้ชายหรือผู้หญิง

import pandas as pd
import nltk
from sklearn.model_selection import train_test_split

blog_data = pd.read_csv('https://attapol.github.io/programming/data/blog-gender-dataset.csv', 
                  names=['blog_post', 'gender'])
blog_data['length'] = blog_data['blog_post'].apply(str).apply(len)
blog_data = blog_data[blog_data['length'] > 80]
blog_data['tokenized_text'] = blog_data['blog_post'].apply(lambda x: '|'.join(nltk.word_tokenize(str(x))))
blog_data['label'] = blog_data['gender'].apply(lambda x: x.upper().strip())

blog_data = blog_data[['label', 'tokenized_text']]
train, the_rest = train_test_split(blog_data, test_size=0.4)
dev, test = train_test_split(the_rest, test_size=0.5)

train.to_csv('train-blog-gender-dataset.csv', index=False)
dev.to_csv('dev-blog-gender-dataset.csv', index=False)
test.to_csv('test-blog-gender-dataset.csv', index=False)

1. Load and clean raw data#

import pandas as pd
data = pd.read_csv('https://attapol.github.io/programming/data/blog-gender-dataset.csv', 
                  names=['text', 'gender'])
data.head(10)
text gender
0 Long time no see. Like always I was rewriting... M
1 Guest Demo: Eric Iverson’s Itty Bitty Search\... M
2 Who moved my Cheese??? The world has been de... M
3 Yesterday I attended a biweekly meeting of an... M
4 Liam is nothing like Natalie. Natalie never w... F
5 In the EU we have browser choice, but few know... M
6 Hmmm.. I really didn't wanna update my blog ti... F
7 happy teachers day..!! who is celebrating..??\... M
8 We watch movies. And we see what the camera in... F
9 Cooking! May be the title of the blog is a gi... F

ตรวจสอบความถูกต้องของข้อมูล#

ลองสุ่มขึ้นมาดูเรื่อย ๆ ว่ามีอะไรผิดปกติมั้ย

data.sample(n=1)['text'].to_list()
["Ahem!This hereby serves as a final reminder that I've moved my blog to my new website, johnsellers.com. Please update your links -- if you even bothered to bookmark this nonsense in the first place, that is.Also, this hereby serves as yet another excuse for me to publish a video of a monkey riding a bicycle. After five years and 454 posts, I am hereby and forevermore moving this blog over to the much-simpler-for-me johnsellers.com.Please update your links.This is your new RSS feed.Enjoy the silence.WHY I'M ANGRY TODAYHoagie crumbs stuck in my sweater! Here's the interview I did with John Cleese.WHY I'M ANGRY TODAYWhy is Chase Bank messing with me? They be frontin'! I am interviewing Lost's Jorge Garcia today, who it turns out has a pretty sweet blog called Dispatches from the Island. Here's my favorite post.In the next few days, I'm going to post a list of the top ten little things that annoyed me this year. And in the next few weeks, I'm going to be moving this blog over to johnsellers.com -- but only after I figure out how to do that. But this is what it's going to look like.Finally, I'm going to stop titling each of my posts after random favorite songs. Instead, I shall name each post after my favorite word in said post.WHY I'M ANGRY TODAYWhy hasn't teleportation been invented yet? That way I could get to the post office without a hassle. And here is my interview with Benicio Del Toro.WHY I'M ANGRY TODAYGoddamn cold. I have three Q+As out right now. In order of how well I "]

ดูความยาวเฉลี่ยของข้อความดูว่ามันอยู่ช่วงที่โอเคมั้ย

(ให้สังเกตวิธีการเพิ่ม คอลัมน์ใหม่ให้กับ DataFrame)

data.shape
(3232, 2)
data['text length'] = data['text'].apply(lambda x: len(str(x)))
data['text length'].describe()
count     3232.000000
mean      2343.352104
std       4579.572823
min          3.000000
25%        570.000000
50%       1049.000000
75%       1879.250000
max      32714.000000
Name: text length, dtype: float64
data.head()
text gender text length
0 Long time no see. Like always I was rewriting... M 954
1 Guest Demo: Eric Iverson’s Itty Bitty Search\... M 1877
2 Who moved my Cheese??? The world has been de... M 5983
3 Yesterday I attended a biweekly meeting of an... M 1132
4 Liam is nothing like Natalie. Natalie never w... F 1139
data['text length'].sum()
7573714

เอ ทำไมบางข้อความมันสั้นจัง ลองลงไปตรวจหน่อย

(สังเกตวิธีการเลือกแค่บางแถว)

is_too_short = data['text length'] < 100
data[is_too_short]
text gender text length
142 NaN NaN 3
999 NaN NaN 3
1010 NaN NaN 3
1024 NaN NaN 3
1471 NaN NaN 3
1521 NaN M 3
1728 i love these vitamins when they were from AARP... M 85

สร้าง DataFrame ขึ้นมาใหม่ที่สะอาดกว่าเดิม

new_data = data[data['text length'] > 80]
print(new_data.shape)
print(data.shape)
(3226, 3)
(3232, 3)

เช็คการกระจายตัวของ Label#

แปลว่า เช็คว่า label แต่ละขนิดเกิดขึ้นมากน้อยกี่ครั้ง ดูด้วยว่ามี label แปลกปลอมมามั้ย

new_data['gender'].value_counts()
M      1546
F      1390
 F      153
 M      126
m         5
f         4
F         1
 M        1
Name: gender, dtype: int64

เปลี่ยนตัวเล็กให้เป็นตัวใหญ่ และเอา space หน้าหลังออกให้หมด

เสร็จแล้วสร้างเป็น column ใหม่แยกกันออกมา

new_data['gender'].apply(lambda x: x.upper().strip()).value_counts()
M    1678
F    1548
Name: gender, dtype: int64
new_data['label'] = new_data['gender'].apply(lambda x: x.upper().strip())
/Users/te/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

2. Tokenize, tag, and parse data#

!pip install nltk
Requirement already satisfied: nltk in /Users/te/opt/anaconda3/lib/python3.7/site-packages (3.6.7)
Requirement already satisfied: click in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (8.0.3)
Requirement already satisfied: regex>=2021.8.3 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (2021.11.10)
Requirement already satisfied: joblib in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (1.1.0)
Requirement already satisfied: tqdm in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (4.62.3)
Requirement already satisfied: importlib-metadata in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from click->nltk) (4.8.2)
Requirement already satisfied: typing-extensions>=3.6.4 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.10.0.2)
Requirement already satisfied: zipp>=0.5 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.6.0)
import nltk
new_data['tokenized_text'] = new_data['text'].apply(lambda x: '|'.join(nltk.word_tokenize(x)))
/Users/te/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
new_data.head()
text gender text length label tokenized_text
0 Long time no see. Like always I was rewriting... M 954 M Long|time|no|see|.|Like|always|I|was|rewriting...
1 Guest Demo: Eric Iverson’s Itty Bitty Search\... M 1877 M Guest|Demo|:|Eric|Iverson|’|s|Itty|Bitty|Searc...
2 Who moved my Cheese??? The world has been de... M 5983 M Who|moved|my|Cheese|?|?|?|The|world|has|been|d...
3 Yesterday I attended a biweekly meeting of an... M 1132 M Yesterday|I|attended|a|biweekly|meeting|of|an|...
4 Liam is nothing like Natalie. Natalie never w... F 1139 F Liam|is|nothing|like|Natalie|.|Natalie|never|w...
new_data = new_data[['tokenized_text', 'gender']]

3. Split the data into train, dev, test sets#

Shuffle ข้อมูล หาจุดตัดใน data set

image.png

from sklearn.model_selection import train_test_split

train, the_rest = train_test_split(new_data, train_size=0.6)
dev, test = train_test_split(the_rest, test_size=0.5)
print(len(train))
print(len(dev))
print(len(test))
1935
645
646

4. Save ใส่ไฟล์#

train.to_csv('train-blog-gender-dataset.csv', index=False)
dev.to_csv('dev-blog-gender-dataset.csv', index=False)
test.to_csv('test-blog-gender-dataset.csv', index=False)
loaded_data = pd.read_csv('train-blog-gender-dataset.csv')
loaded_data.head()