数据分析与挖掘

一 Python 基本语法

变量与数据类型 ：
- Python 中变量无需声明，直接赋值即可。
- 常见的数据类型有数值型（整型 int、浮点型 float、复数型 complex）、字符串型（str，用单引号、双引号或三引号括起来）、布尔型（bool，True 和 False）。
运算符 ：
- 算术运算符（+、-、*、/、%、**、//）、
- 比较运算符（==、!=、>、<、>=、<=）
- 逻辑运算符（and、or、not）
输入输出 ：使用 input() 函数接收用户输入，print() 函数输出结果，eval()函数去掉双引号。
数据结构运用 ：
- 列表（list） ：有序可变序列，用方括号 [] 定义，如 numbers = [1, 2, 3, 4]。支持索引、切片、追加（append()）、删除（del、remove()、pop()）等操作。
- 元组（tuple） ：有序不可变序列，用圆括号 () 定义，如 fruits = (“apple”, “banana”, “orange”)。访问元素方式与列表类似，但不能修改元素。
- 字典（dict） ：键值对集合，用花括号 {} 定义，如 person = {“name”: “Tom”, “age”: 20}。通过键来访问和修改值，键必须唯一且不可变。
- 集合（set） ：无序不重复元素集合，用大括号 {} 或 set() 函数创建，如 colors = {“red”, “green”, “blue”}。支持集合运算（如交集 &、并集 |、差集 -）。
程序流程控制 ：
- 顺序结构 ：按照代码书写的顺序依次执行。
- 选择结构 ：if 语句，根据条件是否成立选择执行不同的代码块。基本语法为 if 条件 1: 代码块 1 elif 条件 2: 代码块 2 … else: 代码块 n。
- 循环结构 ：for 循环用于遍历可迭代对象（如列表、元组、字符串、字典、集合），如 for num in numbers: print(num)。while 循环根据条件判断是否执行循环体，如 while count < 5: print(count) count +=1。

二、函数

函数的语法 ：定义函数使用 def 关键字，后跟函数名、参数列表（可选）、冒号和函数体。函数体结束可使用 return 语句返回结果。例如：
```
def add(a, b):return a + b
```
函数调用 ：直接使用函数名并传入相应参数即可，如 result = add(3, 5)，调用 add 函数并传入 3 和 5 作为参数，返回结果 8 赋值给 result。
参数：
- 位置参数 ：按照定义函数时的参数位置顺序传入，如上面的 a 和 b 就是位置参数。
- 默认参数 ：在定义函数时给参数指定默认值，调用时可不传该参数。例如：def greet(name, msg=“Hello”): print(msg, name)，调用 greet(“Tom”) 时 msg 使用默认值 “Hello”。
- 关键字参数 ：调用函数时通过参数名和值的形式传参，这样可以改变参数的顺序。如 greet(msg=“Hi”, name=“Tom”)。
- 可变参数 ：允许函数接收任意数量的参数。*args 用于接收位置参数，形成元组；**kwargs 用于接收关键字参数，形成字典。例如：
```
def func(*args, **kwargs):print(args)print(kwargs)
func(1, 2, 3, name="Tom", age=20)
```

Lambda匿名函数：

square = lambda x: x ** 2
print(square(4))    # 输出16

自定义函数的编程实现 ：根据需求设计函数的功能，明确输入参数和返回值，编写函数体实现相应逻辑。例如自定义一个计算阶乘的函数：
```
def factorial(n):if n == 0:return 1else:return n * factorial(n-1)
```

三、正则表达式：提取，匹配与替换

元字符与特殊字符 ：
- \d匹配数字，\w匹配字母数字，.匹配任意字符(除换行符）
- ^开头，$ 结尾，| 表示或
- *匹配前面的字符零次或多次，+ 匹配前面的字符一次或多次，? 匹配前面的字符零次或一次，[] 匹配括号内的任意一个字符

字符获取 ：使用 re 模块中的 findall() 函数可获取字符串中所有匹配正则表达式的字符。

import re 
pattern = r"\d+" 
string = "I have 2 apples and 3 bananas." result = re.findall(pattern, string) print(result) 输出 ['2', '3']，获取字符串中的所有数字。

匹配：match() 函数从字符串开头开始匹配，若匹配成功返回匹配对象，否则返回 None。search() 函数在整个字符串中搜索匹配，找到第一个匹配的位置后返回匹配对象。例如：

pattern = r"apple"
string1 = "I like apple."
string2 = "apple is delicious."
print(re.match(pattern, string1)) # None，因为不是从开头匹配print(re.search(pattern, string1)) # 匹配对象print(re.match(pattern, string2)) # 匹配对象

替换：sub() 函数用于替换字符串中匹配正则表达式的部分。例如：

pattern = r"apple"string = "I like apple and apple pie."new_string = re.sub(pattern, "orange", string)
print(new_string) # 输出 "I like orange and orange pie."

四、文件相关知识

文本文件 ：

读取：使用 open() 函数打开文件，默认是只读模式（‘r’）。例如：

 f = open("test.txt", "r")content = f.read() # 读取全部内容# 按行读取 content = f.readlines()f.close()		
``

写入：以写入模式（‘w’）打开文件（会覆盖原有内容）或追加模式（‘a’）打开文件（在文件末尾追加内容）。例如：
```
 f = open("test.txt", "w")f.write("Hello, world!")f.close()
```

CSV 文件 ：利用 Python 的 csv 模块进行读写。读取时使用 csv.reader() 或 csv.DictReader()，写入时使用 csv.writer() 或 csv.DictWriter()。例如：

 import csv
# 读取with open("data.csv", "r") as f:reader = csv.reader(f)for row in reader:print(row)# 写入with open("data.csv", "w", newline="") as f:writer = csv.writer(f)writer.writerow(["Name", "Age"])writer.writerow(["Tom", 20])

Excel 文件 ：借助 pandas 库的 read_excel() 函数读取，to_excel() 函数写入。例如：

#读取df = pd.read_excel("data.xlsx")print(df)
# 写入df.to_excel("new_data.xlsx", index=False)

JSON文件

import json
data = {"name": "Alice", "age": 25}
with open("data.json", "w") as f:json.dump(data, f)            # 写入with open("data.json", "r") as f:loaded_data = json.load(f)    # 读取

五、NumPy 数值计算基础知识

NumPy 数组的创建 ：使用 numpy.array() 函数创建数组，例如：
```
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
```

数组的基本属性 ：包括形状（shape）、维度（ndim）、数据类型（dtype）等。例如：

print(arr.shape) # 输出 (5,)
print(arr.ndim) # 输出 1
print(arr.dtype) # 输出 int32（具体类型可能因系统和数据而异）

数组的运算(广播机制) ：支持元素级运算，如加、减、乘、除、幂等。例如：

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print(arr1 + arr2) # 输出 [5 7 9]
print(arr1 * arr2) # 输出 [4 10 18]
# 广播机制
a = np.array([[1,2][3,4]])
b = np.array([10,20])
print(a + b) # 广播加法 -> [[11,22][13,24]]

数组的索引与切片 ：与 Python 列表类似，但更强大。可以进行多维数组的索引和切片。例如：

arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr[0, 1]) # 输出 2，访问第一行第二列元素print(arr[:, 1:3]) # 输出 [[2 3] [5 6]]，获取所有行的第 2 到第 3 列

六、pandas 统计分析基础知识

DataFrame 和 Series ：
- DataFrame 是二维表格型数据结构，可看作是由多个 Series 组成。
- Series 是一维数组，与 NumPy 数组类似，但带有标签（索引）。例如：
```
import pandas as pd
data = {'Name': ['Tom', 'Jerry'], 'Age': [20, 18]}
df = pd.DataFrame(data)print(df)
# 输出
Name Age
0 Tom 20
1 Jerry 18 
```
基本统计分析 ：使用 describe() 函数可获取数据的统计摘要，包括计数、均值、标准差、最小值、四分位数、最大值等。例如：print(df.describe())
数据排序 ：sort_values() 函数按指定列的值排序，ascending 参数控制升序或降序。例如：
```
# 按年龄降序排列
df_sorted = df.sort_values(by="Age", ascending=False)
print(df_sorted)
```

七、pandas 数据预处理知识

数据清洗 ：

处理缺失值 ：

使用 isnull() 函数检查缺失值，
dropna() 函数删除含有缺失值的行或列，

fillna() 函数填充缺失值。例如：

df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, None, 8]})print(df.isnull()) # 显示缺失值位置df_drop = df.dropna() # 删除有缺失值的行df_fill = df.fillna(0) # 用 0 填充缺失值

数据去重 ：drop_duplicates() 函数用于删除重复数据。例如：

df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [3, 3, 4, 4]})
df_unique = df.drop_duplicates()
print(df_unique)

数据转换 ：
- 独热编码
  
  df_encoded = pd.get_dummies(df, columns=["Category"])
- 数据类型转换 ：astype() 函数可将数据转换为指定类型。例如：df['A'] = df['A'].astype(int)
- 数据标准化、归一化 ：对于数值型数据，可使用 preprocessing 模块中的 scale() 函数进行标准化（Z-score 标准化）或 minmax_scale() 函数进行归一化。例如：
```
from sklearn import preprocessing
data_standardized = preprocessing.scale(df['A'])
data_normalized = preprocessing.minmax_scale(df['A'])
```

数据合并

concat() 函数 ：用于沿着某一轴将多个 DataFrame 或 Series 连接起来。例如:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df_concat = pd.concat([df1, df2], axis=0) # 按行连接

merge() 函数 ：根据一个或多个键将行连接起来，类似于 SQL 中的 join 操作。例如：

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
df_merged = pd.merge(df1, df2, on='key') # 内连接，只保留两个表都有 key 的行