让 Pandas 找出在 pd.read_excel 中要跳过多少行

Python

www说 2021-12-09 15:54:36

我正在尝试将数百个 excel 文件自动读取到单个数据框中。幸运的是，excel 文件的布局相当稳定。它们都有相同的标题（标题的大小写可能会有所不同），当然还有相同的列数，我想读取的数据总是存储在第一个电子表格中。但是，在某些文件中，在实际数据开始之前已经跳过了许多行。在实际数据之前的行中可能有也可能没有评论等。例如，在某些文件中，标题位于第 3 行，然后数据从第 4 行及以下开始。我想pandas自己弄清楚要跳过多少行。目前我使用了一个有点复杂的解决方案......我首先将文件读入数据帧，检查标题是否正确，如果没有搜索找到包含标题的行，然后重新读取文件现在知道有多少行跳过..def find_header_row(df, my_header): """Find the row containing the header.""" for idx, row in df.iterrows(): row_header = [str(t).lower() for t in row] if len(set(my_header) - set(row_header)) == 0: return idx + 1 raise Exception("Cant find header row!")my_header = ['col_1', 'col_2',..., 'col_n']df = pd.read_excel('my_file.xlsx')# Make columns lower case (case may vary)df.columns = [t.lower() for t in df.columns]# Check if the header of the dataframe mathces my_headerif len(set(my_header) - set(df.columns)) != 0: # If no... use my function to find the row containing the header n_rows_to_skip = find_header_row(df, kolonner) # Re-read the dataframe, skipping the right number of rows df = pd.read_excel(fil, skiprows=n_rows_to_skip)既然我知道标题行是什么样子，有没有办法让pandas自己弄清楚数据的开始位置？或者有人能想到更好的解决方案吗？

查看完整描述

1 回答

ibeautiful

TA贡献1993条经验获得超6个赞

让我们知道这是否适合您

import pandas as pd

df = pd.read_excel("unamed1.xlsx")

Unnamed: 0 Unnamed: 1 Unnamed: 2

0 NaN bad row1 badddd row111 NaN

1 baaaa NaN NaN

2 NaN NaN NaN

3 id name age

4 1 Roger 17

5 2 Rosa 23

6 3 Rob 31

7 4 Ives 15

first_row = (df.count(axis = 1) >= df.shape[1]).idxmax()

df.columns = df.loc[first_row]

df = df.loc[first_row+1:]

3 id name age

4 1 Roger 17

5 2 Rosa 23

6 3 Rob 31

7 4 Ives 15

反对回复 2021-12-09

热搜

最近搜索清空

让 Pandas 找出在 pd.read_excel 中要跳过多少行

让 Pandas 找出在 pd.read_excel 中要跳过多少行

1 回答

添加回答