如何在Python Pandas中使用字典顺序切片选择数据子集？

2024-04-20 09:08:06 14

介绍

熊猫具有双重选择功能，可以使用索引位置或索引标签选择数据子集。在这篇文章中，我将向您展示如何“使用词典分类法选择数据的子集”。

Google充满了数据集。在kaggle.com中搜索电影数据集。这篇文章使用kaggle提供的电影数据集。

怎么做

1.导入仅包含此示例所需列的电影数据集。

import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)

标题预算平均投票投票数小声音06.661大人2800000005.81155一生中最美好的时光21000007.6143象牙28000005.1366铬铁矿行动05.829

2.我总是建议对索引进行排序，尤其是当索引由字符串组成时。如果在对索引进行排序时处理庞大的数据集，则会注意到差异。

如果我不对索引排序怎么办？

没问题，您的代码将永远运行。只是开个玩笑，如果索引标签未排序，那么大熊猫必须一一遍历所有标签以匹配您的查询。试想一下，没有索引页的牛津词典，您要做什么？索引排序后，您可以快速跳转到要提取的标签，Pandastoo就是这种情况。

让我们首先检查索引是否已排序。

# check if the index is sorted or not ?
movies.index.is_monotonic

False

3.显然，索引未排序。我们将尝试选择以A％开头的电影。这就像写作

select * from movies where title like'A%'

movies.loc["Aa":"Bb"]

---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,

tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'

4.按升序对索引进行排序，然后尝试使用相同的命令来利用按字典顺序进行排序的优势。

True

5.现在，我们的数据已准备就绪，可以进行字典切片。现在让我们选择所有以字母A到字母B开头的电影。

标题预算平均投票投票数放弃250000004.645弃05.827绑架350000005.6961香港仔07.06关于昨晚125000006.0210............为猿人星球而战17000005.5215年度之战200000005.988战斗：洛杉矶700000005.51448战地地球440000003.0255战舰2090000005.52114

标题预算平均投票投票数Æ通量620000005.4703xXx：国情600000004.7549X700000005.81424存在150000006.7475[REC]²56000006.4489

预算投票_平均投票_计数标题

毫无疑问地看到空的DataFrame，因为数据以相反的顺序排序。让我们反转字母并再次运行。

标题预算平均投票投票数B女孩05.57阿育吠陀：存在的艺术3000005.53我们走了170000006.7189苏醒860000006.3395复仇者联盟：奥创纪元2800000007.36767............关于昨晚125000006.0210香港仔07.06绑架350000005.6961弃05.827放弃250000004.645

如何在Python Pandas中使用字典顺序切片选择数据子集？

介绍

怎么做

如果我不对索引排序怎么办？

预算投票_平均投票_计数标题

热门推荐

随机推荐