分享30个超级好用的Pandas实战技巧("30个实用Pandas技巧大揭秘：提升数据处理效率必备")

原创

ithorizon 4个月前 (10-19) 阅读数 24 #后端开发

30个实用Pandas技巧大揭秘：提升数据处理高效必备

1. 使用`pd.read_csv()`读取大型CSV文件时，设置`chunksize`参数进行分块读取

当处理大型CSV文件时，内存也许会成为约束因素。使用`chunksize`参数可以按块读取文件，每次只处理一个数据块。


df_list = pd.read_csv('large_file.csv', chunksize=10000)
for df in df_list:
    # 处理每个数据块
    pass

2. 使用`usecols`参数只读取需要的列

当CSV文件包含大量列，但你只关心其中几列时，使用`usecols`参数可以加快读取速度。


df = pd.read_csv('file.csv', usecols=['column1', 'column2'])

3. 使用`dtype`参数指定列的数据类型

在读取CSV文件时，通过指定列的数据类型，可以缩减内存消耗并尽也许缩减损耗处理速度。


df = pd.read_csv('file.csv', dtype={'column1': 'float32', 'column2': 'int32'})

4. 使用`pd.read_excel()`读取Excel文件

与CSV文件类似，Pandas也提供了读取Excel文件的函数。


df = pd.read_excel('file.xlsx')

5. 使用`pd.concat()`合并多个DataFrame

当需要将多个DataFrame合并为一个时，`pd.concat()`是一个非常实用的函数。


df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df = pd.concat([df1, df2])

6. 使用`pd.merge()`合并两个DataFrame

凭借一个或多个键将两个DataFrame的行合并在一起。


df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'D', 'D'], 'value': [4, 5, 6]})
df = pd.merge(df1, df2, on='key')

7. 使用`pd.groupby()`对DataFrame进行分组

对DataFrame中的数据进行分组，然后对每个组应用函数。


df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': [1, 3, 2, 5, 3, 6, 3, 8]})
df_grouped = df.groupby(['A', 'B'])

8. 使用`pd.crosstab()`生成交叉表

计算两个（或多个）变量的交叉表。


df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
                   'B': ['one', 'one', 'one', 'two', 'two', 'one', 'one', 'two', 'two'],
                   'C': ['small', 'large', 'large', 'small','small', 'large', 'small', 'small','large']})
ct = pd.crosstab(df['A'], df['B'])

9. 使用`pd.cut()`将连续变量转换成分类变量

将连续变量分割成不同的区间，并转换成分类变量。


df = pd.DataFrame({'age': [22, 55, 62, 45, 21, 22, 34, 42]})
df['age_category'] = pd.cut(df['age'], bins=[0, 20, 40, 60, 100])

10. 使用`pd.value_counts()`计算唯一值的数量

计算DataFrame中某个列的唯一值的数量。


df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': [1, 3, 2, 5, 3, 6, 3, 8]})
df['A'].value_counts()

11. 使用`pd.isnull()`或`pd.isna()`检测缺失值

检测DataFrame中的缺失值。


df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [np.nan, 2, 3, 4]})
df.isnull()

12. 使用`pd.dropna()`删除含有缺失值的行或列

删除DataFrame中包含缺失值的行或列。


df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [np.nan, 2, 3, 4]})
df.dropna()

13. 使用`pd.fillna()`填充缺失值

用指定的值填充DataFrame中的缺失值。


df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [np.nan, 2, 3, 4]})
df.fillna(0)

14. 使用`pd.sort_values()`按值排序DataFrame

凭借一个或多个列的值对DataFrame进行排序。


df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [4, 3, 2, 1]})
df.sort_values(by='B')

15. 使用`pd.sort_index()`按索引排序DataFrame

凭借索引对DataFrame进行排序。


df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [4, 3, 2, 1]})
df.sort_index()

16. 使用`pd.unique()`获取唯一值数组

返回DataFrame中某个列的唯一值数组。


df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': [1, 3, 2, 5, 3, 6, 3, 8]})
pd.unique(df['A'])

17. 使用`pd.nunique()`计算唯一值的数量

计算DataFrame中某个列的唯一值的数量。


df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': [1, 3, 2, 5, 3, 6, 3, 8]})
df['A'].nunique()

18. 使用`pd.concat()`将列表转换成DataFrame

将列表转换成DataFrame。


data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

19. 使用`pd.DataFrame()`创建DataFrame

从字典创建DataFrame。


data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

20. 使用`pd.to_datetime()`转换日期时间字符串

将字符串转换成Pandas的datetime对象。


df = pd.DataFrame({'date': ['2021-01-01', '2021-01-02', '2021-01-03']})
df['date'] = pd.to_datetime(df['date'])

21. 使用`pd.date_range()`生成日期时间序列

生成一系列连续的日期时间。


pd.date_range(start='2021-01-01', periods=3)

22. 使用`pd.Grouper()`进行时间分组

对时间序列数据进行分组。


df = pd.DataFrame({'date': pd.date_range(start='2021-01-01', periods=6, freq='D'),
                   'value': range(6)})
df.groupby(pd.Grouper(key='date', freq='M')).sum()

23. 使用`pd.shift()`进行时间序列的位移

将时间序列向前或向后移动。


df = pd.DataFrame({'date': pd.date_range(start='2021-01-01', periods=3, freq='D'),
                   'value': [1, 2, 3]})
df['shifted'] = df['value'].shift(1)

24. 使用`pd.diff()`计算时间序列的差分

计算时间序列中连续元素之间的差分。


df = pd.DataFrame({'date': pd.date_range(start='2021-01-01', periods=3, freq='D'),
                   'value': [1, 2, 3]})
df['diff'] = df['value'].diff()

25. 使用`pd.apply()`对DataFrame应用自定义函数

对DataFrame的每一行或每一列应用自定义函数。


df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df['sum'] = df.apply(lambda row: row['A'] + row['B'] + row['C'], axis=1)

26. 使用`pd.map()`将值映射到新值

将DataFrame中的值映射到新的值。


df = pd.DataFrame({'A': ['foo', 'bar', 'baz', 'foo', 'bar', 'baz']})
df['B'] = df['A'].map({'foo': 'FOO', 'bar': 'BAR', 'baz': 'BAZ'})

27. 使用`pd.filter()`筛选符合条件的行或列

凭借条件筛选DataFrame中的行或列。


df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df_filtered = df.filter(items=['A', 'B'])

28. 使用`pd.select_dtypes()`筛选特定数据类型的列

凭借数据类型筛选DataFrame中的列。


df = pd.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.0, 6.0], 'C': ['foo', 'bar', 'baz']})
df_numeric = df.select_dtypes(include=[np.number])

29. 使用`pd.get_dummies()`创建独热编码

将分类变量转换成独热编码。


df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']})
df_dummies = pd.get_dummies(df, columns=['A', 'B'])

30. 使用`pd.read_feather()`和`pd.to_feather()`读写Feather文件

Feather是一种高效的数据格式，用于读写大型数据集。


df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.to_feather('my_data.feather')
df_read = pd.read_feather('my_data.feather')

文章标签：后端开发

上一篇：自己动手实现VB.NET控件数组("手把手教你实现VB.NET控件数组") 下一篇：详细分析VB.NET读写文本文件("VB.NET 文本文件读写详解：操作方法与实例分析")