博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
布隆过滤器之Python+Redis
阅读量:5276 次
发布时间:2019-06-14

本文共 7360 字,大约阅读时间需要 24 分钟。

简单的python实现

pip install mmh3

对于安装报错,c++编译错误问题:可以安装    Microsoft Visual C++ Build Tools()

 例子转载(https://www.cnblogs.com/naive/p/5815433.html)

from bitarray import bitarray# 3rd partyimport mmh3class BloomFilter(set):    def __init__(self, size, hash_count):        super(BloomFilter, self).__init__()        self.bit_array = bitarray(size)        self.bit_array.setall(0)        self.size = size        self.hash_count = hash_count    def __len__(self):        return self.size    def __iter__(self):        return iter(self.bit_array)    def add(self, item):        for ii in range(self.hash_count):            index = mmh3.hash(item, ii) % self.size            self.bit_array[index] = 1        return self    def __contains__(self, item):        out = True        for ii in range(self.hash_count):            index = mmh3.hash(item, ii) % self.size            if self.bit_array[index] == 0:                out = False        return outdef main():    bloom = BloomFilter(10000, 10)    animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',               'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',               'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']    # First insertion of animals into the bloom filter    for animal in animals:        bloom.add(animal)    # Membership existence for already inserted animals    # There should not be any false negatives    for animal in animals:        if animal in bloom:            print('{} is in bloom filter as expected'.format(animal))        else:            print('Something is terribly went wrong for {}'.format(animal))            print('FALSE NEGATIVE!')    # Membership existence for not inserted animals    # There could be false positives    other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',                     'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',                     'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',                     'hawk' ]    for other_animal in other_animals:        if other_animal in bloom:            print('{} is not in the bloom, but a false positive'.format(other_animal))        else:            print('{} is not in the bloom filter as expected'.format(other_animal))if __name__ == '__main__':    main()

 

运行结果

dog is in bloom filter as expectedcat is in bloom filter as expectedgiraffe is in bloom filter as expectedfly is in bloom filter as expectedmosquito is in bloom filter as expectedhorse is in bloom filter as expectedeagle is in bloom filter as expectedbird is in bloom filter as expectedbison is in bloom filter as expectedboar is in bloom filter as expectedbutterfly is in bloom filter as expectedant is in bloom filter as expectedanaconda is in bloom filter as expectedbear is in bloom filter as expectedchicken is in bloom filter as expecteddolphin is in bloom filter as expecteddonkey is in bloom filter as expectedcrow is in bloom filter as expectedcrocodile is in bloom filter as expectedbadger is not in the bloom filter as expectedcow is not in the bloom filter as expectedpig is not in the bloom filter as expectedsheep is not in the bloom, but a false positivebee is not in the bloom filter as expectedwolf is not in the bloom filter as expectedfox is not in the bloom filter as expectedwhale is not in the bloom filter as expectedshark is not in the bloom, but a false positivefish is not in the bloom, but a false positiveturkey is not in the bloom filter as expectedduck is not in the bloom filter as expecteddove is not in the bloom误报 filter as expecteddeer is not in the bloom filter as expectedelephant is not in the bloom, but a false positivefrog is not in the bloom filter as expectedfalcon is not in the bloom filter as expectedgoat is not in the bloom filter as expectedgorilla is not in the bloom filter as expectedhawk is not in the bloom filter as expected

 

 

 从输出结果可以发现,存在不少误报样本,但是并不存在假阴性。

不同于这段布隆过滤器的实现代码,其它语言的多个实现版本并不提供哈希函数的参数。这是因为在实际应用中误报比例这个指标比哈希函数更重要,用户可以根据误报比例的需求来调整哈希函数的个数。通常来说,sizeerror_rate是布隆过滤器的真正误报比例。如果你在初始化阶段减小了error_rate,它们会调整哈希函数的数量。

误报

布隆过滤器能够拍着胸脯说某个元素“肯定不存在”,但是对于一些元素它们会说“可能存在”。针对不同的应用场景,这有可能会是一个巨大的缺陷,亦或是无关紧要的问题。如果在检索元素是否存在时不介意引入误报情况,那么你就应当考虑用布隆过滤器。

另外,如果随意地减小了误报比率,哈希函数的数量相应地就要增加,在插入和查询时的延时也会相应地增加。本节的另一个要点是,如果哈希函数是相互独立的,并且输入元素在空间中均匀的分布,那么理论上真实误报率就不会超过理论值。否则,由于哈希函数的相关性和更频繁的哈希冲突,布隆过滤器的真实误报比例会高于理论值。

在使用布隆过滤器时,需要考虑误报的潜在影响。

确定性

当你使用相同大小和数量的哈希函数时,某个元素通过布隆过滤器得到的是正反馈还是负反馈的结果是确定的。对于某个元素x,如果它现在可能存在,那五分钟之后、一小时之后、一天之后、甚至一周之后的状态都是可能存在。当我得知这一特性时有一点点惊讶。因为布隆过滤器是概率性的,那其结果显然应该存在某种随机因素,难道不是吗?确实不是。它的概率性体现在我们无法判断究竟哪些元素的状态是可能存在

换句话说,过滤器一旦做出可能存在的结论后,结论不会发生变化。

 

 

python 基于redis实现的bloomfilter(布隆过滤器),BloomFilter_imooc

BloomFilter_imooc下载

下载地址:https://github.com/liyaopinner/BloomFilter_imooc

 

 py_bloomfilter.py(布隆过滤器)源码:

import mmh3import redisimport mathimport timeclass PyBloomFilter():    #内置100个随机种子    SEEDS = [543, 460, 171, 876, 796, 607, 650, 81, 837, 545, 591, 946, 846, 521, 913, 636, 878, 735, 414, 372,             344, 324, 223, 180, 327, 891, 798, 933, 493, 293, 836, 10, 6, 544, 924, 849, 438, 41, 862, 648, 338,             465, 562, 693, 979, 52, 763, 103, 387, 374, 349, 94, 384, 680, 574, 480, 307, 580, 71, 535, 300, 53,             481, 519, 644, 219, 686, 236, 424, 326, 244, 212, 909, 202, 951, 56, 812, 901, 926, 250, 507, 739, 371,             63, 584, 154, 7, 284, 617, 332, 472, 140, 605, 262, 355, 526, 647, 923, 199, 518]    #capacity是预先估计要去重的数量    #error_rate表示错误率    #conn表示redis的连接客户端    #key表示在redis中的键的名字前缀    def __init__(self, capacity=1000000000, error_rate=0.00000001, conn=None, key='BloomFilter'):        self.m = math.ceil(capacity*math.log2(math.e)*math.log2(1/error_rate))      #需要的总bit位数        self.k = math.ceil(math.log1p(2)*self.m/capacity)                           #需要最少的hash次数        self.mem = math.ceil(self.m/8/1024/1024)                                    #需要的多少M内存        self.blocknum = math.ceil(self.mem/512)                                     #需要多少个512M的内存块,value的第一个字符必须是ascii码,所有最多有256个内存块        self.seeds = self.SEEDS[0:self.k]        self.key = key        self.N = 2**31-1        self.redis = conn        # print(self.mem)        # print(self.k)    def add(self, value):        name = self.key + "_" + str(ord(value[0])%self.blocknum)        hashs = self.get_hashs(value)        for hash in hashs:            self.redis.setbit(name, hash, 1)    def is_exist(self, value):        name = self.key + "_" + str(ord(value[0])%self.blocknum)        hashs = self.get_hashs(value)        exist = True        for hash in hashs:            exist = exist & self.redis.getbit(name, hash)        return exist    def get_hashs(self, value):        hashs = list()        for seed in self.seeds:            hash = mmh3.hash(value, seed)            if hash >= 0:                hashs.append(hash)            else:                hashs.append(self.N - hash)        return hashspool = redis.ConnectionPool(host='127.0.0.1', port=6379, db=0)conn = redis.StrictRedis(connection_pool=pool)# 使用方法# if __name__ == "__main__":#     bf = PyBloomFilter(conn=conn)           # 利用连接池连接Redis#     bf.add('www.jobbole.com')               # 向Redis默认的通道添加一个域名#     bf.add('www.luyin.org')                 # 向Redis默认的通道添加一个域名#     print(bf.is_exist('www.zhihu.com'))     # 打印此域名在通道里是否存在,存在返回1,不存在返回0#     print(bf.is_exist('www.luyin.org'))     # 打印此域名在通道里是否存在,存在返回1,不存在返回0

 

转载于:https://www.cnblogs.com/yhll/p/9842514.html

你可能感兴趣的文章
优雅地书写回调——Promise
查看>>
AX 2009 Grid控件下多选行
查看>>
PHP的配置
查看>>
Struts框架----进度1
查看>>
Round B APAC Test 2017
查看>>
MySQL 字符编码问题详细解释
查看>>
Windows 2003全面优化
查看>>
格而知之2:UIView的autoresizingMask属性探究
查看>>
我的Hook学习笔记
查看>>
寄Android开发Gradle你需要知道的知识
查看>>
整理推荐的CSS属性书写顺序
查看>>
css & input type & search icon
查看>>
C# 强制关闭当前程序进程(完全Kill掉不留痕迹)
查看>>
ssm框架之将数据库的数据导入导出为excel文件
查看>>
语音识别中的MFCC的提取原理和MATLAB实现
查看>>
0320-学习进度条
查看>>
MetaWeblog API Test
查看>>
移动、尺寸改变
查看>>
c# 文件笔记
查看>>
类和结构
查看>>