{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# ตัวอย่างการใช้ `pandas`\n",
"\n",
"Example : Blog gender dataset\n",
"\n",
"เราต้องการเตรียมข้อมูลสำหรับการสร้างเครื่องแยกแยะว่า Blog post นั้นเขียนโดยผู้เขียนที่เป็นผู้ชายหรือผู้หญิง"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import nltk\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"blog_data = pd.read_csv('https://attapol.github.io/programming/data/blog-gender-dataset.csv', \n",
" names=['blog_post', 'gender'])\n",
"blog_data['length'] = blog_data['blog_post'].apply(str).apply(len)\n",
"blog_data = blog_data[blog_data['length'] > 80]\n",
"blog_data['tokenized_text'] = blog_data['blog_post'].apply(lambda x: '|'.join(nltk.word_tokenize(str(x))))\n",
"blog_data['label'] = blog_data['gender'].apply(lambda x: x.upper().strip())\n",
"\n",
"blog_data = blog_data[['label', 'tokenized_text']]\n",
"train, the_rest = train_test_split(blog_data, test_size=0.4)\n",
"dev, test = train_test_split(the_rest, test_size=0.5)\n",
"\n",
"train.to_csv('train-blog-gender-dataset.csv', index=False)\n",
"dev.to_csv('dev-blog-gender-dataset.csv', index=False)\n",
"test.to_csv('test-blog-gender-dataset.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 1. Load and clean raw data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" text | \n",
" gender | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Long time no see. Like always I was rewriting... | \n",
" M | \n",
"
\n",
" \n",
" 1 | \n",
" Guest Demo: Eric Iverson’s Itty Bitty Search\\... | \n",
" M | \n",
"
\n",
" \n",
" 2 | \n",
" Who moved my Cheese??? The world has been de... | \n",
" M | \n",
"
\n",
" \n",
" 3 | \n",
" Yesterday I attended a biweekly meeting of an... | \n",
" M | \n",
"
\n",
" \n",
" 4 | \n",
" Liam is nothing like Natalie. Natalie never w... | \n",
" F | \n",
"
\n",
" \n",
" 5 | \n",
" In the EU we have browser choice, but few know... | \n",
" M | \n",
"
\n",
" \n",
" 6 | \n",
" Hmmm.. I really didn't wanna update my blog ti... | \n",
" F | \n",
"
\n",
" \n",
" 7 | \n",
" happy teachers day..!! who is celebrating..??\\... | \n",
" M | \n",
"
\n",
" \n",
" 8 | \n",
" We watch movies. And we see what the camera in... | \n",
" F | \n",
"
\n",
" \n",
" 9 | \n",
" Cooking! May be the title of the blog is a gi... | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text gender\n",
"0 Long time no see. Like always I was rewriting... M\n",
"1 Guest Demo: Eric Iverson’s Itty Bitty Search\\... M\n",
"2 Who moved my Cheese??? The world has been de... M\n",
"3 Yesterday I attended a biweekly meeting of an... M\n",
"4 Liam is nothing like Natalie. Natalie never w... F\n",
"5 In the EU we have browser choice, but few know... M\n",
"6 Hmmm.. I really didn't wanna update my blog ti... F\n",
"7 happy teachers day..!! who is celebrating..??\\... M\n",
"8 We watch movies. And we see what the camera in... F\n",
"9 Cooking! May be the title of the blog is a gi... F"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"data = pd.read_csv('https://attapol.github.io/programming/data/blog-gender-dataset.csv', \n",
" names=['text', 'gender'])\n",
"data.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### ตรวจสอบความถูกต้องของข้อมูล\n",
"\n",
"ลองสุ่มขึ้นมาดูเรื่อย ๆ ว่ามีอะไรผิดปกติมั้ย"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[\"Ahem!This hereby serves as a final reminder that I've moved my blog to my new website, johnsellers.com. Please update your links -- if you even bothered to bookmark this nonsense in the first place, that is.Also, this hereby serves as yet another excuse for me to publish a video of a monkey riding a bicycle. After five years and 454 posts, I am hereby and forevermore moving this blog over to the much-simpler-for-me johnsellers.com.Please update your links.This is your new RSS feed.Enjoy the silence.WHY I'M ANGRY TODAYHoagie crumbs stuck in my sweater! Here's the interview I did with John Cleese.WHY I'M ANGRY TODAYWhy is Chase Bank messing with me? They be frontin'! I am interviewing Lost's Jorge Garcia today, who it turns out has a pretty sweet blog called Dispatches from the Island. Here's my favorite post.In the next few days, I'm going to post a list of the top ten little things that annoyed me this year. And in the next few weeks, I'm going to be moving this blog over to johnsellers.com -- but only after I figure out how to do that. But this is what it's going to look like.Finally, I'm going to stop titling each of my posts after random favorite songs. Instead, I shall name each post after my favorite word in said post.WHY I'M ANGRY TODAYWhy hasn't teleportation been invented yet? That way I could get to the post office without a hassle. And here is my interview with Benicio Del Toro.WHY I'M ANGRY TODAYGoddamn cold. I have three Q+As out right now. In order of how well I \"]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.sample(n=1)['text'].to_list()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"ดูความยาวเฉลี่ยของข้อความดูว่ามันอยู่ช่วงที่โอเคมั้ย\n",
"\n",
"(ให้สังเกตวิธีการเพิ่ม คอลัมน์ใหม่ให้กับ `DataFrame`)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3232, 2)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.shape"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data['text length'] = data['text'].apply(lambda x: len(str(x)))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 3232.000000\n",
"mean 2343.352104\n",
"std 4579.572823\n",
"min 3.000000\n",
"25% 570.000000\n",
"50% 1049.000000\n",
"75% 1879.250000\n",
"max 32714.000000\n",
"Name: text length, dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['text length'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" text | \n",
" gender | \n",
" text length | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Long time no see. Like always I was rewriting... | \n",
" M | \n",
" 954 | \n",
"
\n",
" \n",
" 1 | \n",
" Guest Demo: Eric Iverson’s Itty Bitty Search\\... | \n",
" M | \n",
" 1877 | \n",
"
\n",
" \n",
" 2 | \n",
" Who moved my Cheese??? The world has been de... | \n",
" M | \n",
" 5983 | \n",
"
\n",
" \n",
" 3 | \n",
" Yesterday I attended a biweekly meeting of an... | \n",
" M | \n",
" 1132 | \n",
"
\n",
" \n",
" 4 | \n",
" Liam is nothing like Natalie. Natalie never w... | \n",
" F | \n",
" 1139 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text gender text length\n",
"0 Long time no see. Like always I was rewriting... M 954\n",
"1 Guest Demo: Eric Iverson’s Itty Bitty Search\\... M 1877\n",
"2 Who moved my Cheese??? The world has been de... M 5983\n",
"3 Yesterday I attended a biweekly meeting of an... M 1132\n",
"4 Liam is nothing like Natalie. Natalie never w... F 1139"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7573714"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['text length'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"เอ ทำไมบางข้อความมันสั้นจัง ลองลงไปตรวจหน่อย\n",
"\n",
"(สังเกตวิธีการเลือกแค่บางแถว)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" text | \n",
" gender | \n",
" text length | \n",
"
\n",
" \n",
" \n",
" \n",
" 142 | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" 999 | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" 1010 | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" 1024 | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" 1471 | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
" 1521 | \n",
" NaN | \n",
" M | \n",
" 3 | \n",
"
\n",
" \n",
" 1728 | \n",
" i love these vitamins when they were from AARP... | \n",
" M | \n",
" 85 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text gender text length\n",
"142 NaN NaN 3\n",
"999 NaN NaN 3\n",
"1010 NaN NaN 3\n",
"1024 NaN NaN 3\n",
"1471 NaN NaN 3\n",
"1521 NaN M 3\n",
"1728 i love these vitamins when they were from AARP... M 85"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"is_too_short = data['text length'] < 100\n",
"data[is_too_short]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"สร้าง `DataFrame` ขึ้นมาใหม่ที่สะอาดกว่าเดิม"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3226, 3)\n",
"(3232, 3)\n"
]
}
],
"source": [
"new_data = data[data['text length'] > 80]\n",
"print(new_data.shape)\n",
"print(data.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### เช็คการกระจายตัวของ Label\n",
"แปลว่า เช็คว่า label แต่ละขนิดเกิดขึ้นมากน้อยกี่ครั้ง ดูด้วยว่ามี label แปลกปลอมมามั้ย"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"M 1546\n",
"F 1390\n",
" F 153\n",
" M 126\n",
"m 5\n",
"f 4\n",
"F 1\n",
" M 1\n",
"Name: gender, dtype: int64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data['gender'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"เปลี่ยนตัวเล็กให้เป็นตัวใหญ่ และเอา space หน้าหลังออกให้หมด\n",
"\n",
"เสร็จแล้วสร้างเป็น column ใหม่แยกกันออกมา\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"M 1678\n",
"F 1548\n",
"Name: gender, dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data['gender'].apply(lambda x: x.upper().strip()).value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/te/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"new_data['label'] = new_data['gender'].apply(lambda x: x.upper().strip())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 2. Tokenize, tag, and parse data\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: nltk in /Users/te/opt/anaconda3/lib/python3.7/site-packages (3.6.7)\n",
"Requirement already satisfied: click in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (8.0.3)\n",
"Requirement already satisfied: regex>=2021.8.3 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (2021.11.10)\n",
"Requirement already satisfied: joblib in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (1.1.0)\n",
"Requirement already satisfied: tqdm in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from nltk) (4.62.3)\n",
"Requirement already satisfied: importlib-metadata in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from click->nltk) (4.8.2)\n",
"Requirement already satisfied: typing-extensions>=3.6.4 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.10.0.2)\n",
"Requirement already satisfied: zipp>=0.5 in /Users/te/opt/anaconda3/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.6.0)\n"
]
}
],
"source": [
"!pip install nltk"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/te/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" \n"
]
}
],
"source": [
"import nltk\n",
"new_data['tokenized_text'] = new_data['text'].apply(lambda x: '|'.join(nltk.word_tokenize(x)))"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" text | \n",
" gender | \n",
" text length | \n",
" label | \n",
" tokenized_text | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Long time no see. Like always I was rewriting... | \n",
" M | \n",
" 954 | \n",
" M | \n",
" Long|time|no|see|.|Like|always|I|was|rewriting... | \n",
"
\n",
" \n",
" 1 | \n",
" Guest Demo: Eric Iverson’s Itty Bitty Search\\... | \n",
" M | \n",
" 1877 | \n",
" M | \n",
" Guest|Demo|:|Eric|Iverson|’|s|Itty|Bitty|Searc... | \n",
"
\n",
" \n",
" 2 | \n",
" Who moved my Cheese??? The world has been de... | \n",
" M | \n",
" 5983 | \n",
" M | \n",
" Who|moved|my|Cheese|?|?|?|The|world|has|been|d... | \n",
"
\n",
" \n",
" 3 | \n",
" Yesterday I attended a biweekly meeting of an... | \n",
" M | \n",
" 1132 | \n",
" M | \n",
" Yesterday|I|attended|a|biweekly|meeting|of|an|... | \n",
"
\n",
" \n",
" 4 | \n",
" Liam is nothing like Natalie. Natalie never w... | \n",
" F | \n",
" 1139 | \n",
" F | \n",
" Liam|is|nothing|like|Natalie|.|Natalie|never|w... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text gender text length \\\n",
"0 Long time no see. Like always I was rewriting... M 954 \n",
"1 Guest Demo: Eric Iverson’s Itty Bitty Search\\... M 1877 \n",
"2 Who moved my Cheese??? The world has been de... M 5983 \n",
"3 Yesterday I attended a biweekly meeting of an... M 1132 \n",
"4 Liam is nothing like Natalie. Natalie never w... F 1139 \n",
"\n",
" label tokenized_text \n",
"0 M Long|time|no|see|.|Like|always|I|was|rewriting... \n",
"1 M Guest|Demo|:|Eric|Iverson|’|s|Itty|Bitty|Searc... \n",
"2 M Who|moved|my|Cheese|?|?|?|The|world|has|been|d... \n",
"3 M Yesterday|I|attended|a|biweekly|meeting|of|an|... \n",
"4 F Liam|is|nothing|like|Natalie|.|Natalie|never|w... "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"new_data = new_data[['tokenized_text', 'gender']]"
]
},
{
"attachments": {
"image.png": {
"image/png": ""
}
},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 3. Split the data into train, dev, test sets\n",
"\n",
"Shuffle ข้อมูล \n",
"หาจุดตัดใน data set \n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train, the_rest = train_test_split(new_data, train_size=0.6)\n",
"dev, test = train_test_split(the_rest, test_size=0.5)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1935\n",
"645\n",
"646\n"
]
}
],
"source": [
"print(len(train))\n",
"print(len(dev))\n",
"print(len(test))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 4. Save ใส่ไฟล์"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train.to_csv('train-blog-gender-dataset.csv', index=False)\n",
"dev.to_csv('dev-blog-gender-dataset.csv', index=False)\n",
"test.to_csv('test-blog-gender-dataset.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loaded_data = pd.read_csv('train-blog-gender-dataset.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"loaded_data.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
},
"vscode": {
"interpreter": {
"hash": "34368ba4908ea1be08ba769dfb7764ab7f8ead2384ebb5604cb86637573696f7"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}