ตัวอย่างการใช้ pandas

ตัวอย่างการใช้ `pandas`#

Example : Blog gender dataset

เราต้องการเตรียมข้อมูลสำหรับการสร้างเครื่องแยกแยะว่า Blog post นั้นเขียนโดยผู้เขียนที่เป็นผู้ชายหรือผู้หญิง

import pandas as pd
import nltk
from sklearn.model_selection import train_test_split

blog_data = pd.read_csv('https://attapol.github.io/programming/data/blog-gender-dataset.csv', 
                  names=['blog_post', 'gender'])
blog_data['length'] = blog_data['blog_post'].apply(str).apply(len)
blog_data = blog_data[blog_data['length'] > 80]
blog_data['tokenized_text'] = blog_data['blog_post'].apply(lambda x: '|'.join(nltk.word_tokenize(str(x))))
blog_data['label'] = blog_data['gender'].apply(lambda x: x.upper().strip())

blog_data = blog_data[['label', 'tokenized_text']]
train, the_rest = train_test_split(blog_data, test_size=0.4)
dev, test = train_test_split(the_rest, test_size=0.5)

train.to_csv('train-blog-gender-dataset.csv', index=False)
dev.to_csv('dev-blog-gender-dataset.csv', index=False)
test.to_csv('test-blog-gender-dataset.csv', index=False)

1. Load and clean raw data#

import pandas as pd
data = pd.read_csv('https://attapol.github.io/programming/data/blog-gender-dataset.csv', 
                  names=['text', 'gender'])
data.head(10)

	text	gender
0	Long time no see. Like always I was rewriting...	M
1	Guest Demo: Eric Iverson’s Itty Bitty Search\...	M
2	Who moved my Cheese??? The world has been de...	M
3	Yesterday I attended a biweekly meeting of an...	M
4	Liam is nothing like Natalie. Natalie never w...	F
5	In the EU we have browser choice, but few know...	M
6	Hmmm.. I really didn't wanna update my blog ti...	F
7	happy teachers day..!! who is celebrating..??\...	M
8	We watch movies. And we see what the camera in...	F
9	Cooking! May be the title of the blog is a gi...	F

ตรวจสอบความถูกต้องของข้อมูล#

ลองสุ่มขึ้นมาดูเรื่อย ๆ ว่ามีอะไรผิดปกติมั้ย

data.sample(n=1)['text'].to_list()

["Ahem!This hereby serves as a final reminder that I've moved my blog to my new website, johnsellers.com. Please update your links -- if you even bothered to bookmark this nonsense in the first place, that is.Also, this hereby serves as yet another excuse for me to publish a video of a monkey riding a bicycle. After five years and 454 posts, I am hereby and forevermore moving this blog over to the much-simpler-for-me johnsellers.com.Please update your links.This is your new RSS feed.Enjoy the silence.WHY I'M ANGRY TODAYHoagie crumbs stuck in my sweater! Here's the interview I did with John Cleese.WHY I'M ANGRY TODAYWhy is Chase Bank messing with me? They be frontin'! I am interviewing Lost's Jorge Garcia today, who it turns out has a pretty sweet blog called Dispatches from the Island. Here's my favorite post.In the next few days, I'm going to post a list of the top ten little things that annoyed me this year. And in the next few weeks, I'm going to be moving this blog over to johnsellers.com -- but only after I figure out how to do that. But this is what it's going to look like.Finally, I'm going to stop titling each of my posts after random favorite songs. Instead, I shall name each post after my favorite word in said post.WHY I'M ANGRY TODAYWhy hasn't teleportation been invented yet? That way I could get to the post office without a hassle. And here is my interview with Benicio Del Toro.WHY I'M ANGRY TODAYGoddamn cold. I have three Q+As out right now. In order of how well I "]

ดูความยาวเฉลี่ยของข้อความดูว่ามันอยู่ช่วงที่โอเคมั้ย

(ให้สังเกตวิธีการเพิ่ม คอลัมน์ใหม่ให้กับ DataFrame)

data.shape

(3232, 2)

data['text length'] = data['text'].apply(lambda x: len(str(x)))

data['text length'].describe()

count     3232.000000
mean      2343.352104
std       4579.572823
min          3.000000
25%        570.000000
50%       1049.000000
75%       1879.250000
max      32714.000000
Name: text length, dtype: float64

data.head()

	text	gender	text length
0	Long time no see. Like always I was rewriting...	M	954
1	Guest Demo: Eric Iverson’s Itty Bitty Search\...	M	1877
2	Who moved my Cheese??? The world has been de...	M	5983
3	Yesterday I attended a biweekly meeting of an...	M	1132
4	Liam is nothing like Natalie. Natalie never w...	F	1139

data['text length'].sum()

เอ ทำไมบางข้อความมันสั้นจัง ลองลงไปตรวจหน่อย

(สังเกตวิธีการเลือกแค่บางแถว)

is_too_short = data['text length'] < 100
data[is_too_short]

	text	gender	text length
142	NaN	NaN	3
999	NaN	NaN	3
1010	NaN	NaN	3
1024	NaN	NaN	3
1471	NaN	NaN	3
1521	NaN	M	3
1728	i love these vitamins when they were from AARP...	M	85

สร้าง DataFrame ขึ้นมาใหม่ที่สะอาดกว่าเดิม

new_data = data[data['text length'] > 80]
print(new_data.shape)
print(data.shape)

(3226, 3)
(3232, 3)

เช็คการกระจายตัวของ Label#

แปลว่า เช็คว่า label แต่ละขนิดเกิดขึ้นมากน้อยกี่ครั้ง ดูด้วยว่ามี label แปลกปลอมมามั้ย

new_data['gender'].value_counts()

M      1546
F      1390
 F      153
 M      126
m         5
f         4
F         1
 M        1
Name: gender, dtype: int64

เปลี่ยนตัวเล็กให้เป็นตัวใหญ่ และเอา space หน้าหลังออกให้หมด

เสร็จแล้วสร้างเป็น column ใหม่แยกกันออกมา

new_data['gender'].apply(lambda x: x.upper().strip()).value_counts()

M    1678
F    1548
Name: gender, dtype: int64

new_data['label'] = new_data['gender'].apply(lambda x: x.upper().strip())

/Users/te/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

2. Tokenize, tag, and parse data#

!pip install nltk

Requirement already satisfied: nltk in /Users/te/opt/anaconda3/lib/python3.7/site-packages (3.6.7)
Requirement already satisfied: click in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (8.0.3)
Requirement already satisfied: regex>=2021.8.3 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (2021.11.10)
Requirement already satisfied: joblib in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (1.1.0)
Requirement already satisfied: tqdm in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (4.62.3)
Requirement already satisfied: importlib-metadata in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from click->nltk) (4.8.2)
Requirement already satisfied: typing-extensions>=3.6.4 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.10.0.2)
Requirement already satisfied: zipp>=0.5 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.6.0)

import nltk
new_data['tokenized_text'] = new_data['text'].apply(lambda x: '|'.join(nltk.word_tokenize(x)))

/Users/te/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

new_data.head()

	text	gender	text length	label	tokenized_text
0	Long time no see. Like always I was rewriting...	M	954	M	Long\|time\|no\|see\|.\|Like\|always\|I\|was\|rewriting...
1	Guest Demo: Eric Iverson’s Itty Bitty Search\...	M	1877	M	Guest\|Demo\|:\|Eric\|Iverson\|’\|s\|Itty\|Bitty\|Searc...
2	Who moved my Cheese??? The world has been de...	M	5983	M	Who\|moved\|my\|Cheese\|?\|?\|?\|The\|world\|has\|been\|d...
3	Yesterday I attended a biweekly meeting of an...	M	1132	M	Yesterday\|I\|attended\|a\|biweekly\|meeting\|of\|an\|...
4	Liam is nothing like Natalie. Natalie never w...	F	1139	F	Liam\|is\|nothing\|like\|Natalie\|.\|Natalie\|never\|w...

new_data = new_data[['tokenized_text', 'gender']]

3. Split the data into train, dev, test sets#

Shuffle ข้อมูล หาจุดตัดใน data set

from sklearn.model_selection import train_test_split

train, the_rest = train_test_split(new_data, train_size=0.6)
dev, test = train_test_split(the_rest, test_size=0.5)

print(len(train))
print(len(dev))
print(len(test))

1935
645
646

4. Save ใส่ไฟล์#

train.to_csv('train-blog-gender-dataset.csv', index=False)
dev.to_csv('dev-blog-gender-dataset.csv', index=False)
test.to_csv('test-blog-gender-dataset.csv', index=False)

loaded_data = pd.read_csv('train-blog-gender-dataset.csv')

loaded_data.head()