首页手记爬虫 (二十二) 最常见的文件操作模块（值得收藏） (十三)

爬虫 (二十二) 最常见的文件操作模块（值得收藏） (十三)

标签：

Python

一：os模块
os模块提供了许多允许你程序与操作系统直接交互的功能

os.getcwd() 获取当前工作目录，即当前python脚本工作的目录路径
 
os.chdir("dirname")  改变当前脚本工作目录；相当于shell下cd
 
os.curdir  返回当前目录: ('.')
 
os.pardir  获取当前目录的父目录字符串名：('..')
 
os.makedirs('dirname1/dirname2')    可生成多层递归目录
 
os.removedirs('dirname1')    若目录为空，则删除，并递归到上一级目录，如若也为空，则删除，依此类推
 
os.mkdir('dirname')    生成单级目录；相当于shell中mkdir dirname
 
os.rmdir('dirname')    删除单级空目录，若目录不为空则无法删除，报错；相当于shell中rmdir dirname
 
os.listdir('dirname')    列出指定目录下的所有文件和子目录，包括隐藏文件，并以列表方式打印
 
os.remove()  删除一个文件
 
os.rename("oldname","newname")  重命名文件/目录
 
os.stat('path/filename')  获取文件/目录信息
 
os.sep    输出操作系统特定的路径分隔符，win下为"\\",Linux下为"/"
 
os.linesep    输出当前平台使用的行终止符，win下为"\t\n",Linux下为"\n"
 
os.pathsep    输出用于分割文件路径的字符串 win下为;,Linux下为:
 
os.name    输出字符串指示当前使用平台。win->'nt'; Linux->'posix'
 
os.system("bash command")  运行shell命令，直接显示
 
os.environ  获取系统环境变量
 
os.path.abspath(path)  返回path规范化的绝对路径
 
os.path.split(path)  将path分割成目录和文件名二元组返回
 
os.path.dirname(path)  返回path的目录。其实就是os.path.split(path)的第一个元素
 
os.path.basename(path)  返回path最后的文件名。如何path以／或\结尾，那么就会返回空值。即         os.path.split(path)的第二个元素
 
os.path.exists(path)  如果path存在，返回True；如果path不存在，返回False
 
os.path.isabs(path)  如果path是绝对路径，返回True
 
os.path.isfile(path)  如果path是一个存在的文件，返回True。否则返回False
 
os.path.isdir(path)  如果path是一个存在的目录，则返回True。否则返回False
 
os.path.join(path1[, path2[, ...]])  将多个路径组合后返回，第一个绝对路径之前的参数将被忽略
 
os.path.getatime(path)  返回path所指向的文件或者目录的最后存取时间
 
os.path.getmtime(path)  返回path所指向的文件或者目录的最后修改时间

1，当前路径及路径下的文件

os.getcwd()：查看当前所在路径。
os.listdir(path):列举目录下的所有文件。返回的是列表类型

>> import os
>>> os.getcwd()
>>> os.listdir(os.getcwd())

2，绝对路径

os.path.abspath(path):返回path的绝对路径

>>> os.path.abspath('.')
>>> os.path.abspath('..')

3，查看路径的文件夹部分和文件名部分

os.path.split(path):将路径分解为(文件夹,文件名)，返回的是元组类型。可以看出，若路径字符串最后一个字符是,则只有文件夹部分有值；若路径字符串中均无,则只有文件名部分有值。若路径字符串有\，且不在最后，则文件夹和文件名均有值。且返回的文件夹的结果不包含.
os.path.join(path1,path2,…):将path进行组合，若其中有绝对路径，则之前的path将被删除

>>> os.path.split('D:\\pythontest\\ostest\\Hello.py')
>>> os.path.split('.')
>>> os.path.split('D:\\pythontest\\ostest\\')
>>> os.path.split('D:\\pythontest\\ostest')
>>> os.path.join('D:\\pythontest', 'ostest')
>>> os.path.join('D:\\pythontest\\ostest', 'hello.py')
>>> os.path.join('D:\\pythontest\\b', 'D:\\pythontest\\a')

os.path.dirname(path):返回path中的文件夹部分，结果不包含’’

>>> os.path.dirname('D:\\pythontest\\ostest\\hello.py')
>>> os.path.dirname('.')
>>> os.path.dirname('D:\\pythontest\\ostest\\')
>>> os.path.dirname('D:\\pythontest\\ostest')

os.path.basename(path):返回path中的文件名

>>> os.path.basename('D:\\pythontest\\ostest\\hello.py')
>>> os.path.basename('.')
>>> os.path.basename('D:\\pythontest\\ostest\\')
>>> os.path.basename('D:\\pythontest\\ostest')

4，查看文件时间

os.path.getmtime(path):文件或文件夹的最后修改时间，从新纪元到访问时的秒数
os.path.getatime(path):文件或文件夹的最后访问时间，从新纪元到访问时的秒数
os.path.getctime(path):文件或文件夹的创建时间，从新纪元到访问时的秒数

>>> os.path.getmtime('D:\\pythontest\\ostest\\hello.py')
>>> os.path.getatime('D:\\pythontest\\ostest\\hello.py')
>>> os.path.getctime('D:\\pythontest\\ostest\\hello.py')

5，查看文件大小

os.path.getsize(path):文件或文件夹的大小，若是文件夹返回0

>>> os.path.getsize('D:\\pythontest\\ostest\\hello.py')
>>> os.path.getsize('D:\\pythontest\\ostest')

6，查看文件是否存在

os.path.exists(path):文件或文件夹是否存在，返回True 或 False

>>> os.listdir(os.getcwd())
>>> os.path.exists('D:\\pythontest\\ostest\\hello.py')
>>> os.path.exists('D:\\pythontest\\ostest\\Hello.py')
>>> os.path.exists('D:\\pythontest\\ostest\\Hello1.py')

7，一些表现形式参数

os中定义了一组文件、路径在不同操作系统中的表现形式参数，如：

>>> os.sep
'\\'
>>> os.extsep
'.'
>>> os.pathsep
';'
>>> os.linesep
'\r\n'

8 在读文件的时候往往需要遍历文件夹，下面写一下遍历文件夹操作的方法

import os
 
txt_rootdir = 'txt_result'
filename_list = os.listdir(txt_rootdir)

for i in range(0,len(filename_list)):
    file_path = os.path.join(txt_rootdir,filename_list[i])
    if os.path.isfile(file_path):
        print("ok running")
        # 删除一个文件
        # os.remove()

9，os.listdir和os.walk获得文件路径的区别

参考文献：https://www.cnblogs.com/jiaxin359/p/7324077.html
　　情况一：在一个目录下面只有文件，没有文件夹的时候，这个时候可以使用os.listdir
　　情况二：在递归的情况，一个目录下面既有目录也有文件，使用os.walk

9.1 情况1详解
　　在我们的桌面上有一个file目录（文件夹），里面有三个文件，如下：

import os
 
path = r'C:\Users\Administrator\Desktop\file'
for filename in os.listdir(path):
    print(os.path.join(path,filename))

使用os.listdir读取到一个目录下面所有的文件，然后使用os.path.join 把目录和路径和文件结合起来，就得到了文件的绝对路径，结果如下：

C:\Users\Administrator\Desktop\file\test1.txt
C:\Users\Administrator\Desktop\file\test2.txt
C:\Users\Administrator\Desktop\file\test3.txt

9.2 情况2详解
　　我们首先在桌面上建立一个file目录，里面的组织结构如下：
运行一下代码：

import os
 
path = r'C:\Users\Administrator\Desktop\file'
for dirpath,dirnames,filenames in os.walk(path):
    print(dirpath,dirnames,filenames)

输出结果如下：

C:\Users\Administrator\Desktop\file ['file1', 'file2'] ['file_test1.txt', 'file_test2 .txt']
C:\Users\Administrator\Desktop\file\file1 [] ['file1_test1.txt', 'file1_test2.txt']
C:\Users\Administrator\Desktop\file\file2 [] ['file2_test1.txt']

os.walk输入一个路径名称，以yield的方式（其实是一个生成器）返回一个三元组 dirpath, dirnames, filenames，

dirpath为目录的路径，为一个字符串。比如上面的 C:\Users\Administrator\Desktop\fileC:\Users\Administrator\Desktop\file\file1等。

dirnames列出了目录路径下面所有存在的目录的名称。比如在 C:\Users\Administrator\Desktop\file下面有两个目录：file1和file2，那么它就列出了这个目录路径下的目录名称。

filenames列出了目录路径下面所有文件的名称。同样在 C:\Users\Administrator\Desktop\file下面有两个文件file_test1.txt和file_test2 .txt，那么程序将会列出这两个文件名。

如何获得一个路径下面所有的文件路径：

import os
path = r'C:\Users\Administrator\Desktop\file'
for dirpath,dirnames,filenames in os.walk(path):
    for filename in filenames:
        print(os.path.join(dirpath,filename))

得到的结果如下：

C:\Users\Administrator\Desktop\file\file_test1.txt
C:\Users\Administrator\Desktop\file\file_test2 .txt
C:\Users\Administrator\Desktop\file\file1\file1_test1.txt
C:\Users\Administrator\Desktop\file\file1\file1_test2.txt
C:\Users\Administrator\Desktop\file\file2\file2_test1.txt

二，sys模块

sys.argv           命令行参数List，第一个元素是程序本身路径
sys.exit(n)        退出程序，正常退出时exit(0)
sys.version        获取Python解释程序的版本信息
sys.maxint         最大的Int值
sys.path           返回模块的搜索路径，初始化时使用PYTHONPATH环境变量的值
sys.platform       返回操作系统平台名称
sys.stdout.write('please:')  #标准输出 , 引出进度条的例子， 注，在py3上不行，可以用print代替
val = sys.stdin.readline()[:-1] #标准输入
sys.getrecursionlimit() #获取最大递归层数
sys.setrecursionlimit(1200) #设置最大递归层数
sys.getdefaultencoding()  #获取解释器默认编码
sys.getfilesystemencoding  #获取内存数据存到文件里的默认编码

1，sys.path ：返回模块的搜索路径，初始化使用pythonpath环境变量的值
sys.path.append("自定义模块路径)

>>> import os
>>> import sys
>>> sys.path
['', 'D:\\python3\\python36.zip', 'D:\\python3\\DLLs', 'D:\\python3\\lib', 'D:\\python3', 'D:\\python3\\lib\\site-packages', 'D:\\python3\\lib\\site-packages\\win32', 'D:\\python3\\lib\\site-packages\\win32\\lib', 'D:\\python3\\lib\\site-packages\\Pythonwin']

2，进度条：

#=========知识储备==========
#进度条的效果
[#             ]
[##            ]
[###           ]
[####          ]
 
#指定宽度
print('[%-15s]' %'#')
print('[%-15s]' %'##')
print('[%-15s]' %'###')
print('[%-15s]' %'####')
 
#打印%
print('%s%%' %(100)) #第二个%号代表取消第一个%的特殊意义
 
#可传参来控制宽度
print('[%%-%ds]' %50) #[%-50s]
print(('[%%-%ds]' %50) %'#')
print(('[%%-%ds]' %50) %'##')
print(('[%%-%ds]' %50) %'###')
 
 
#=========实现打印进度条函数==========
import sys
import time
 
def progress(percent,width=50):
    if percent >= 1:
        percent=1
    show_str=('[%%-%ds]' %width) %(int(width*percent)*'#')
    print('\r%s %d%%' %(show_str,int(100*percent)),file=sys.stdout,flush=True,end='')
 
 
#=========应用==========
data_size=1025
recv_size=0
while recv_size < data_size:
    time.sleep(0.1) #模拟数据的传输延迟
    recv_size+=1024 #每次收1024
 
    percent=recv_size/data_size #接收的比例
    progress(percent,width=70) #进度条的宽度70

3，练习

import sys
 
sys.platform
'win32'
 
sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
 
sys.argv
['D:\\pycharm专业版\\PyCharm 2017.1.4\\helpers\\pydev\\pydevconsole.py', '49701', '49702']
 
sys.version
'3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)]'
 
sys.stdin.readline()
'the line is write'
"'the line is write'\n"

三，shutil模块
高级的文件、文件夹、压缩包处理模块
shutil.copyfileobj(fsrc, fdst[, length])
将文件内容拷贝到另一个文件中

import shutil
shutil.copyfileobj(open('old.xml','r'), open('new.xml', 'w'))

shutil.copyfile(src, dst)
拷贝文件

shutil.copyfile('f1.log', 'f2.log') #目标文件无需存在

shutil.copymode(src, dst)
仅拷贝权限。内容、组、用户均不变

shutil.copymode('f1.log', 'f2.log') #目标文件必须存在

shutil.copystat(src, dst)
仅拷贝状态的信息，包括：mode bits, atime, mtime, flags

shutil.copystat('f1.log', 'f2.log') #目标文件必须存在

shutil.copy(src, dst)
拷贝文件和权限

import shutil
shutil.copy('f1.log', 'f2.log')

shutil.copy2(src, dst)
拷贝文件和状态信息

import shutil
shutil.copy2('f1.log', 'f2.log')

shutil.ignore_patterns(*patterns)
shutil.copytree(src, dst, symlinks=False, ignore=None)
递归的去拷贝文件夹

import shutil
shutil.copytree('folder1', 'folder2', ignore=shutil.ignore_patterns('*.pyc', 'tmp*'))
#目标目录不能存在，注意对folder2目录父级目录要有可写权限，ignore的意思是排除

shutil.rmtree(path[, ignore_errors[, onerror]])
递归的去删除文件

import shutil
shutil.rmtree('folder1')

shutil.move(src, dst)
递归的去移动文件，它类似mv命令，其实就是重命名

import shutil
shutil.move('folder1', 'folder3')

shutil.make_archive(base_name, format,…)
创建压缩包并返回文件路径，例如：zip、tar
创建压缩包并返回文件路径，例如：zip、tar

base_name：压缩包的文件名，也可以是压缩包的路径。只是文件名时，则保存至当前目录，否则保存至指定路径，

如 data_bak =>保存至当前路径
如：/tmp/data_bak =>保存至/tmp/

format：压缩包种类，“zip”, “tar”, “bztar”，“gztar”
root_dir：要压缩的文件夹路径（默认当前目录）
owner：用户，默认当前用户
group：组，默认当前组
logger：用于记录日志，通常是logging.Logger对象

#将 /data 下的文件打包放置当前程序目录
import shutil
ret = shutil.make_archive("data_bak", 'gztar', root_dir='/data')
 
#将 /data下的文件打包放置 /tmp/目录
import shutil
ret = shutil.make_archive("/tmp/data_bak", 'gztar', root_dir='/data')

shutil 对压缩包的处理是调用 ZipFile 和 TarFile 两个模块来进行的，详细：
zipfile压缩&解压缩

import zipfile
 
# 压缩
z = zipfile.ZipFile('laxi.zip', 'w')
z.write('a.log')
z.write('data.data')
z.close()
 
# 解压
z = zipfile.ZipFile('laxi.zip', 'r')
z.extractall(path='.')
z.close()

tarfile压缩&解压缩

import tarfile
 
# 压缩
>>> t=tarfile.open('/tmp/egon.tar','w')
>>> t.add('/test1/a.py',arcname='a.bak')
>>> t.add('/test1/b.py',arcname='b.bak')
>>> t.close()
 
# 解压
>>> t=tarfile.open('/tmp/egon.tar','r')
>>> t.extractall('/egon')
>>> t.close()

四，subprocess模块
　　我们经常需要通过Python去执行一条系统命令或脚本，系统的shell命令是独立于你的python进程之外的，每执行一条命令，就是发起一个新进程，通过python调用系统命令或脚本的模块在python2有os.system

>> os.system('uname -a')
Darwin Alexs-MacBook-Pro.local 15.6.0 Darwin Kernel Version 15.6.0: Sun Jun  4 21:43:07 PDT 2017; root:xnu-3248.70.3~1/RELEASE_X86_64 x86_64
0

call()方法

#执行命令，返回命令执行状态 ， 0 or 非0
>>> retcode = subprocess.call(["ls", "-l"])
 
#执行命令，如果命令结果为0，就正常返回，否则抛异常
>>> subprocess.check_call(["ls", "-l"])
0
 
#接收字符串格式命令，返回元组形式，第1个元素是执行状态，第2个是命令结果
>>> subprocess.getstatusoutput('ls /bin/ls')
(0, '/bin/ls')
 
#接收字符串格式命令，并返回结果
>>> subprocess.getoutput('ls /bin/ls')
'/bin/ls'
 
#执行命令，并返回结果，注意是返回结果，不是打印，下例结果返回给res
>>> res=subprocess.check_output(['ls','-l'])
>>> res
b'total 0\ndrwxr-xr-x 12 alex staff 408 Nov 2 11:05 OldBoyCRM\n'

Popen()方法

args：shell命令，可以是字符串或者序列类型（如：list，元组）
 
 
stdin, stdout, stderr：分别表示程序的标准输入、输出、错误句柄
 
preexec_fn：只在Unix平台下有效，用于指定一个可执行对象（callable object），
它将在子进程运行之前被调用
 
shell：同上
 
cwd：用于设置子进程的当前目录
 
env：用于指定子进程的环境变量。如果env = None，子进程的环境变量将从父进程中继承

下面这2条语句执行会有什么区别？

a=subprocess.run('sleep 10',shell=True,stdout=subprocess.PIPE)
a=subprocess.Popen('sleep 10',shell=True,stdout=subprocess.PIPE)

区别是Popen会在发起命令后立刻返回，而不等命令执行结果。这样的好处是什么呢？

如果你调用的命令或脚本需要执行10分钟，你的主程序不需卡在这里等10分钟，可以继续往下走，干别的事情，每过一会，通过一个什么方法来检测一下命令是否执行完成就好了。

Popen调用后会返回一个对象，可以通过这个对象拿到命令执行结果或状态等，该对象有以下方法

poll()
    Check if child process has terminated. Returns returncode
 
wait()
    Wait for child process to terminate. Returns returncode attribute.
 
terminate()
    终止所启动的进程Terminate the process with SIGTERM
 
kill()
    杀死所启动的进程 Kill the process with SIGKILL
 
communicate()
    与启动的进程交互，发送数据到stdin,并从stdout接收输出，然后等待任务结束
 
send_signal(signal.xxx)
　　发送系统信号
 
pid 
　　拿到所启动进程的进程号

>>> a = subprocess.Popen('python3 guess_age.py',stdout=subprocess.PIPE,stderr=subprocess.PIPE,stdin=subprocess.PIPE,shell=True)
 
>>> a.communicate(b'22')
 
(b'your guess:try bigger\n', b'')

五，configparser模块
此模块用于生成和修改常见配置文档，当前模块的名称在 python 3.x 版本中变更为 configparser。

configparser用于处理特定格式的文件，其本质上是利用open来操作文件。

来看一个好多软件的常见配置文件格式如下

配置文件的格式与windows ini文件类似，可以包含一个或多个节（section），每个节可以有多个参数（键=值）

[DEFAULT]
ServerAliveInterval = 45
Compression = yes
CompressionLevel = 9
ForwardX11 = yes
 
[bitbucket.org]
User = hg
 
[topsecret.server.com]
Port = 50022
ForwardX11 = no

想要生成这样一个文档怎么做？

import configparser
   
config = configparser.ConfigParser()
config["DEFAULT"] = {'ServerAliveInterval': '45',
                      'Compression': 'yes',
                     'CompressionLevel': '9'}
   
config['bitbucket.org'] = {}
config['bitbucket.org']['User'] = 'hg'
config['topsecret.server.com'] = {}
topsecret = config['topsecret.server.com']
topsecret['Host Port'] = '50022'     # mutates the parser
topsecret['ForwardX11'] = 'no'  # same here
config['DEFAULT']['ForwardX11'] = 'yes'
with open('example.ini', 'w') as configfile:
   config.write(configfile)

Config Parser方法

1、config=ConfigParser.ConfigParser() 
创建ConfigParser实例 
   
2、config.sections() 
返回配置文件中节序列 
   
3、config.options(section) 
返回某个项目中的所有键的序列 
   
4、config.get(section,option) 
返回section节中，option的键值 
   
5、config.add_section(str) 
添加一个配置文件节点(str) 
   
6、config.set(section,option,val) 
设置section节点中，键名为option的值(val) 
   
7、config.read(filename) 
读取配置文件 
   
8、config.write(obj_file) 
写入配置文件

练习：

>>> import configparser
>>> config = configparser.ConfigParser()
>>> config.sections()
[]
>>> config.read('example.ini')
['example.ini']
>>> config.sections()
['bitbucket.org', 'topsecret.server.com']
>>> 'bitbucket.org' in config
True
>>> 'bytebong.com' in config
False
>>> config['bitbucket.org']['User']
'hg'
>>> config['DEFAULT']['Compression']
'yes'
>>> topsecret = config['topsecret.server.com']
>>> topsecret['ForwardX11']
'no'
>>> topsecret['Port']
'50022'
>>> for key in config['bitbucket.org']: print(key)
...
user
compressionlevel
serveraliveinterval
compression
forwardx11
>>> config['bitbucket.org']['ForwardX11']
'yes'

其他增删该查语法

import configparser
 
config = configparser.ConfigParser()
 
#---------------------------------------------查
print(config.sections())   #[]
 
config.read('example.ini')
 
print(config.sections())   #['bitbucket.org', 'topsecret.server.com']
 
print('bytebong.com' in config)# False
 
print(config['bitbucket.org']['User']) # hg
 
print(config['DEFAULT']['Compression']) #yes
 
print(config['topsecret.server.com']['ForwardX11'])  #no
 
 
for key in config['bitbucket.org']:
    print(key)
 
 
# user
# serveraliveinterval
# compression
# compressionlevel
# forwardx11
 
 
print(config.options('bitbucket.org'))#['user', 'serveraliveinterval', 'compression', 'compressionlevel', 'forwardx11']
print(config.items('bitbucket.org'))  #[('serveraliveinterval', '45'), ('compression', 'yes'), ('compressionlevel', '9'), ('forwardx11', 'yes'), ('user', 'hg')]
 
print(config.get('bitbucket.org','compression'))#yes
 
 
#---------------------------------------------删,改,增(config.write(open('i.cfg', "w")))
 
 
config.add_section('yuan')
 
config.remove_section('topsecret.server.com')
config.remove_option('bitbucket.org','user')
 
config.set('bitbucket.org','k1','11111')
 
config.write(open('i.cfg', "w"))

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

KenNaNa

Web前端工程师

手记
篇

粉丝

23

获赞与收藏

130

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 27326 923

Python 算法入门教程

15个小节 29990 1154

Python 进阶应用教程

38个小节 72203 1128

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

爬虫 (二十二) 最常见的文件操作模块（值得收藏） (十三)

阅读免费教程