为了账号安全,请及时绑定邮箱和手机立即绑定

根据重叠值将字典拆分为字典列表

根据重叠值将字典拆分为字典列表

慕容森 2022-09-06 16:58:32
我有一本带有染色体坐标的字典,如下例所示:First_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],Key2: ['chr10', 19010495, 19014658],Key3: ['chr10', 19010502, 19014641],Key4: ['chr10', 37375766, 37377526],Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],Key6: ['chr11', 14806147, 14814006]} 我想创建一个字典列表,其中那些具有染色体坐标最小值和最大值(字典值)的当前键与至少1000重叠,被分组到一个新字典中,其余的是新列表中的单独字典。所以理想情况下,像这样:New_list = [{Key1: ['chr10', 19010495, 19014590, 19014064],Key2: ['chr10', 19010495, 19014658], Key3: ['chr10', 19010502, 19014641]}, {Key4: ['chr10', 37375766, 37377526]},{Key5: ['chr10', 76310389, 76315990, 76312224, 76312963]},{Key6: ['chr11', 14806147, 14814006]}]其中 key1、key2 和 key3 作为新字典组合在一起,New_list因为它们的染色体坐标重叠,而 key4、key5、key6 是具有New_list的单个字典,因为它们根本不重叠。我最初的想法是将“First_dict”分离到一个字典列表中,使用[{k: v} for (k, v) in First_dict.items()]然后循环访问每个字典,将最小值和最大值与上一个字典进行比较,以检查重叠,然后创建一个新列表。但是我有几个问题,我无法解决问题。我还寻找了将字典分组在一起的其他问题,例如在问题中:将Python字典键分组为列表,并使用此列表作为值创建一个新字典。但我的问题是,我的 Vales 并不总是完全相同,就像上面的例子一样。在考虑重叠时,我也必须考虑染色体。任何人都可以帮忙,或者提出一个尝试的建议吗?多谢。
查看完整描述

1 回答

?
MYYA

TA贡献1868条经验 获得超4个赞

这个问题可能更适合基于图形的解决方案。没有任何方法可以防止多个范围以不同的时间间隔重叠。


#!/usr/bin/env python3

  

from pprint import pprint

from itertools import groupby



def mapper(d, overlap=1000):

    """Each chromsomal coordinate must be interrogated

    to determine if it is within +/-overlap of any other

    

    Range within any other    Original Dictionary     Transcript

    value will match          key and chromosome      element from the list

    ------------------------  ----------------------  ----------

    (el-overlap, el+overlap), (dict-key, chromosome), el)

    """

    for key, ch in d.items():

        for el in ch[1:]:

            yield ((el-overlap, el+overlap), (key, ch[0]), el)


def sorted_mapper(d, overlap=1000):

    """Simply sort the mapper data by its first element

    """

    for r in sorted(mapper(d, overlap), key=lambda x: x[0]):

        yield r


def groups(iter_):

    previous = next(iter_)

    retval = [previous]

    for chrm in iter_:

        if previous[0][0] <= chrm[-1] <= previous[0][1]:

            retval.append(chrm)

        else:

            yield retval

            previous = chrm

            retval = [previous]

    yield retval


def reduce_phase1(iter_):

    for l in iter_:

        retval = {}

        for (minc, maxc), (key, lbl), chrm in l:

            x = retval.get(key,[lbl])

            x.append(chrm)

            retval[key] = x

        yield retval


def update_dict(d1, d2):

    retval = d1

    for key, value in d2.items():

        if key in d1.keys():

            retval[key].extend(value[1:])

    return retval


def reduce_phase2(iter_):

    retval = [next(iter_)]

    retval_keys = [set([k for k in retval[0].keys()])]

    for d in iter_:

        keyset = set([k for k in d.keys()])

        isnew = True

        for i, e in enumerate(retval_keys):

            if keyset <= e:

                isnew = False

                retval[i] = update_dict(retval[i], d)

        if isnew:

            retval.append(d)

            retval_keys.append(keyset)

    return retval


First_dict = {Key1: ['chr10', 19010495, 19014590, 19014064],

Key2: ['chr10', 19010495, 19014658],

Key3: ['chr10', 19010502, 19014641],

Key4: ['chr10', 37375766, 37377526],

Key5: ['chr10', 76310389, 76315990, 76312224, 76312963],

Key6: ['chr11', 14806147, 14814006]} 


New_list = [

        {

            "Key1": ['chr10', 19010495, 19014590, 19014064],

            "Key2": ['chr10', 19010495, 19014658],

            "Key3": ['chr10', 19010502, 19014641]

        },

        {"Key4": ['chr10', 37375766, 37377526]},

        {"Key5": ['chr10', 76310389, 76315990, 76312224, 76312963]},

        {"Key6": ['chr11', 14806147, 14814006]}

]


pprint(First_dict)

print('-'*40)

g = groups(sorted_ranges(First_dict))

p1 = reduce_phase1(groups(sorted_ranges(First_dict)))

p2 = reduce_phase2(p1)

pprint(p2)


输出

{'Key1': ['chr10', 19010495, 19014590, 19014064],

 'Key2': ['chr10', 19010495, 19014658],

 'Key3': ['chr10', 19010502, 19014641],

 'Key4': ['chr10', 37375766, 37377526],

 'Key5': ['chr10', 76310389, 76315990, 76312224, 76312963],

 'Key6': ['chr11', 14806147, 14814006]}

----------------------------------------

[{'Key6': ['chr11', 14806147, 14814006]},

 {'Key1': ['chr10', 19010495, 19014064, 19014590],

  'Key2': ['chr10', 19010495, 19014658],

  'Key3': ['chr10', 19010502, 19014641]},

 {'Key4': ['chr10', 37375766, 37377526]},

 {'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}]

TLDR;

映射器输出

映射器为每个字典键和染色体元素发出一条记录。每条记录都有一个关联的范围,可以在其中匹配其元素。


((el-1000, el+1000), (dict-key, chromosome), el)

(el-1000,el+1000)是任何其他染色体元素可以匹配的范围。


(字典键,染色体)这条染色体的原始字典。


el是染色体坐标中的一个元素。


((19009495, 19011495), ('Key1', 'chr10'), 19010495)

((19013590, 19015590), ('Key1', 'chr10'), 19014590)

((19013064, 19015064), ('Key1', 'chr10'), 19014064)

((19009495, 19011495), ('Key2', 'chr10'), 19010495)

((19013658, 19015658), ('Key2', 'chr10'), 19014658)

((19009502, 19011502), ('Key3', 'chr10'), 19010502)

((19013641, 19015641), ('Key3', 'chr10'), 19014641)

((37374766, 37376766), ('Key4', 'chr10'), 37375766)

((37376526, 37378526), ('Key4', 'chr10'), 37377526)

((76309389, 76311389), ('Key5', 'chr10'), 76310389)

((76314990, 76316990), ('Key5', 'chr10'), 76315990)

((76311224, 76313224), ('Key5', 'chr10'), 76312224)

((76311963, 76313963), ('Key5', 'chr10'), 76312963)

((14805147, 14807147), ('Key6', 'chr11'), 14806147)

((14813006, 14815006), ('Key6', 'chr11'), 14814006)

注意:映射器的输出未排序。


排序

我们需要使用 (el-1000, el+1000) 作为键对转换后的数据进行排序。

这将允许我们检查下一个值是否在上一个值的范围内。由于键按排序顺序排列,因此我们将能够将指定重叠范围内的值链接在一起。


((14805147, 14807147), ('Key6', 'chr11'), 14806147)

((14813006, 14815006), ('Key6', 'chr11'), 14814006)

((19009495, 19011495), ('Key1', 'chr10'), 19010495)

((19009495, 19011495), ('Key2', 'chr10'), 19010495)

((19009502, 19011502), ('Key3', 'chr10'), 19010502)

((19013064, 19015064), ('Key1', 'chr10'), 19014064)

((19013590, 19015590), ('Key1', 'chr10'), 19014590)

((19013641, 19015641), ('Key3', 'chr10'), 19014641)

((19013658, 19015658), ('Key2', 'chr10'), 19014658)

((37374766, 37376766), ('Key4', 'chr10'), 37375766)

((37376526, 37378526), ('Key4', 'chr10'), 37377526)

((76309389, 76311389), ('Key5', 'chr10'), 76310389)

((76311224, 76313224), ('Key5', 'chr10'), 76312224)

((76311963, 76313963), ('Key5', 'chr10'), 76312963)

((76314990, 76316990), ('Key5', 'chr10'), 76315990)

对指定重叠范围内的值进行分组。出现的列表将包含来自染色体的值,这些染色体位于前一条染色体的重叠范围内。


[((14805147, 14807147), ('Key6', 'chr11'), 14806147)]

----------------------------------------

[((14813006, 14815006), ('Key6', 'chr11'), 14814006)]

----------------------------------------

[((19009495, 19011495), ('Key1', 'chr10'), 19010495),

 ((19009495, 19011495), ('Key2', 'chr10'), 19010495),

 ((19009502, 19011502), ('Key3', 'chr10'), 19010502)]

----------------------------------------

[((19013064, 19015064), ('Key1', 'chr10'), 19014064),

 ((19013590, 19015590), ('Key1', 'chr10'), 19014590),

 ((19013641, 19015641), ('Key3', 'chr10'), 19014641),

 ((19013658, 19015658), ('Key2', 'chr10'), 19014658)]

----------------------------------------

[((37374766, 37376766), ('Key4', 'chr10'), 37375766)]

----------------------------------------

[((37376526, 37378526), ('Key4', 'chr10'), 37377526)]

----------------------------------------

[((76309389, 76311389), ('Key5', 'chr10'), 76310389)]

----------------------------------------

[((76311224, 76313224), ('Key5', 'chr10'), 76312224),

 ((76311963, 76313963), ('Key5', 'chr10'), 76312963)]

----------------------------------------

[((76314990, 76316990), ('Key5', 'chr10'), 76315990)]

----------------------------------------

减少 - 第 1 阶段

通过删除工程功能来清理数据。


{'Key6': ['chr11', 14806147]}

----------------------------------------

{'Key6': ['chr11', 14814006]}

----------------------------------------

{'Key1': ['chr10', 19010495],

 'Key2': ['chr10', 19010495],

 'Key3': ['chr10', 19010502]}

----------------------------------------

{'Key1': ['chr10', 19014064, 19014590],

 'Key2': ['chr10', 19014658],

 'Key3': ['chr10', 19014641]}

----------------------------------------

{'Key4': ['chr10', 37375766]}

----------------------------------------

{'Key4': ['chr10', 37377526]}

----------------------------------------

{'Key5': ['chr10', 76310389]}

----------------------------------------

{'Key5': ['chr10', 76312224, 76312963]}

----------------------------------------

{'Key5': ['chr10', 76315990]}

----------------------------------------

减少 - 第 2 阶段

将替换的字典键与其原始字典聚合。当字典键匹配时,追加相应染色体的值。


{'Key6': ['chr11', 14806147, 14814006]}

----------------------------------------

{'Key1': ['chr10', 19010495, 19014064, 19014590],

 'Key2': ['chr10', 19010495, 19014658],

 'Key3': ['chr10', 19010502, 19014641]}

----------------------------------------

{'Key4': ['chr10', 37375766, 37377526]}

----------------------------------------

{'Key5': ['chr10', 76310389, 76312224, 76312963, 76315990]}

----------------------------------------


查看完整回答
反对 回复 2022-09-06
  • 1 回答
  • 0 关注
  • 80 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信