如何在Python Pandas中使用字典顺序切片选择数据子集?
介绍
熊猫具有双重选择功能,可以使用索引位置或索引标签选择数据子集。在这篇文章中,我将向您展示如何“使用词典分类法选择数据的子集”。
Google充满了数据集。在kaggle.com中搜索电影数据集。这篇文章使用kaggle提供的电影数据集。
怎么做
1.导入仅包含此示例所需列的电影数据集。
import pandas as pd import numpy as np movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title", usecols=["title","budget","vote_average","vote_count"]) movies.sample(n=5)
2.我总是建议对索引进行排序,尤其是当索引由字符串组成时。如果在对索引进行排序时处理庞大的数据集,则会注意到差异。
如果我不对索引排序怎么办?
没问题,您的代码将永远运行。只是开个玩笑,如果索引标签未排序,那么大熊猫必须一一遍历所有标签以匹配您的查询。试想一下,没有索引页的牛津词典,您要做什么?索引排序后,您可以快速跳转到要提取的标签,Pandastoo就是这种情况。
让我们首先检查索引是否已排序。
# check if the index is sorted or not ? movies.index.is_monotonic False
3.显然,索引未排序。我们将尝试选择以A%开头的电影。这就像写作
select * from movies where title like'A%'
movies.loc["Aa":"Bb"]
--------------------------------------------------------------------------- ValueErrorTraceback (most recent call last) ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind) 4844try: -> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError: ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side) 4805 -> 4806raise ValueError("index must be monotonic increasing or decreasing") 4807 ValueError: index must be monotonic increasing or decreasing During handling of the above exception, another exception occurred: KeyErrorTraceback (most recent call last) in ----> 1 movies.loc["Aa": "Bb"] ~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key) 1766 1767maybe_callable = com.apply_if_callable(key, self.obj) -> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769 1770def _is_scalar_access(self, key: Tuple): ~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1910if isinstance(key, slice): 1911self._validate_key(key, axis) -> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key): 1914return self._getbool_axis(key, axis=axis) ~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis) 1794 1795labels = obj._get_axis(axis) -> 1796indexer = labels.slice_indexer( 1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798) ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind) 4711slice(1, 3) 4712""" -> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd) 4714 4715# return a slice ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind) 4924start_slice = None 4925if start is not None: -> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None: 4928start_slice = 0 ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind) 4846except ValueError: 4847# raise the original KeyError -> 4848raise err 4849 4850if isinstance(slc, np.ndarray): ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind) 4840# we need to look up the label 4841try: -> 4842slc = self.get_loc(label) 4843except KeyError as err: 4844try: ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2646return self._engine.get_loc(key) 2647except KeyError: -> 2648return self._engine.get_loc(self._maybe_cast_indexer(key)) 2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1: pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'
4.按升序对索引进行排序,然后尝试使用相同的命令来利用按字典顺序进行排序的优势。
True
5.现在,我们的数据已准备就绪,可以进行字典切片。现在让我们选择所有以字母A到字母B开头的电影。
预算投票_平均投票_计数标题
毫无疑问地看到空的DataFrame,因为数据以相反的顺序排序。让我们反转字母并再次运行。