所以,我想在我的整个目录中搜索包含正则表达式列表的文件。这包括:目录、pdf 和 csv 文件。仅搜索文本文件时,我可以成功完成此任务,但搜索所有文件类型却很困难。以下是我迄今为止的工作:import globimport reimport PyPDF2#-------------------------------------------------Input----------------------------------------------------------------------------------------------folder_path = "/home/"file_pattern = "/*"folder_contents = glob.glob(folder_path + file_pattern)#Search for Emailsregex1= re.compile(r'\S+@\S+')#Search for Phone Numbersregex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')#Search for Locationsregex3 =re.compile("([A-Z]\w+), ([A-Z]{2})")for file in folder_contents: read_file = open(file, 'rt').read()if readile_file == pdf: pdfFileObj = open('pdf.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(0) content= pageObj.extractText()) if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file): print ("YES, This file containts PHI") print(file) else: print("No, This file DOES NOT contain PHI") print(file)当我运行它时,我收到此错误:YES, This file containts PHI/home/e136320/sample.txtNo, This file DOES NOT contain PHI/home/e136320/medicalSample.txt---------------------------------------------------------------------------UnicodeDecodeError Traceback (most recent call last)<ipython-input-129-be0b68229c20> in <module>() 19 20 for file in folder_contents:---> 21 read_file = open(file, 'rt').read() 22 if readile_file == pdf: 23 # creating a pdf file objectUnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte有什么建议?
1 回答
UYOU
TA贡献1878条经验 获得超4个赞
你不能打开这样的 pdf 文件,它需要一个纯文本文件。你可以使用这样的东西:
fn, ext = os.path.splitext(file)
if ext == '.pdf':
open_function = PyPDF2.PdfFileReader
else: # plain text
open_function = open
with open_function(file, 'rt') as open_file:
# Do something with open file...
此代码段检查文件扩展名,然后根据它找到的内容分配一个打开函数,这有点幼稚,可以使用类似于此答案中显示的方法来做得更好。
添加回答
举报
0/150
提交
取消