在 Debian 上进行 Python 数据分析的实操指南
一 环境准备与安装
sudo apt update && sudo apt upgradesudo apt install python3 python3-pipsudo apt install python3-numpy python3-pandas python3-matplotlibpip3 install seaborn scikit-learn jupyterpython3 -m venv venv && source venv/bin/activatepip install numpy pandas matplotlib seaborn scikit-learn jupyterwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.shbash Miniconda3-latest-Linux-x86_64.sh,按提示完成后执行 conda initconda create -n datasci python=3.11 numpy pandas matplotlib seaborn scikit-learn jupyter二 交互式开发与可视化
jupyter notebook(会自动打开浏览器,创建或打开 .ipynb 即可交互式分析)df.describe()import matplotlib.pyplot as plt; import seaborn as snstips = sns.load_dataset("tips")sns.scatterplot(x="total_bill", y="tip", data=tips); plt.title("total bill vs tip"); plt.show()%matplotlib inline(Notebook)或 %matplotlib widget(交互式后端)获得更好绘图体验三 从数据导入到建模的示例工作流
import pandas as pddf = pd.read_csv('data.csv')df['Age'].fillna(df['Age'].mean(), inplace=True)df.drop_duplicates(inplace=True)df.info()、df.head()df.describe()、sns.pairplot(df)、sns.boxplot(x='species', y='petal_length', data=iris)X = df[['sepal_length','sepal_width','petal_length','petal_width']]; y = df['species']from sklearn.model_selection import train_test_split; X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)from sklearn.ensemble import RandomForestClassifier; from sklearn.metrics import classification_reportclf = RandomForestClassifier(random_state=42); clf.fit(X_train, y_train)y_pred = clf.predict(X_test); print(classification_report(y_test, y_pred))四 连接 MySQL 等外部数据源
sudo apt install mysql-serversudo mysql_secure_installation/etc/mysql/mysql.conf.d/mysqld.cnf,将 bind-address 设为 0.0.0.0,重启:sudo systemctl restart mysqlpip install pymysql sqlalchemyfrom sqlalchemy import create_engineengine = create_engine('mysql+pymysql://user:password@localhost/dbname')df = pd.read_sql_query("SELECT * FROM table_name", engine)五 常见问题与优化建议
pandas.read_csv(chunksize=...))、按需列读取(usecols)降低内存占用requirements.txt 或 environment.yml 固化依赖;版本控制数据与代码,记录实验参数与随机种子