Filter sequence elements
You have a data sequence and want to use some rules to extract the required values or shorten the sequence
The simplest way to filter sequence elements is to use list derivation. For example:
>>> mylist = [1, 4, -5, 10, -7, 2, 3, -1] >>> [n for n in mylist if n > 0] [1, 4, 10, 2, 3] >>> [n for n in mylist if n < 0] [-5, -7, -1] >>>
A potential drawback of using list derivation is that if the input is very large, it will produce a very large result set and occupy a lot of memory. If you are memory sensitive, you can use the generator expression iteration to generate filtered elements. For example:
>>> pos = (n for n in mylist if n > 0) >>> pos <generator object <genexpr> at 0x1006a0eb0> >>> for x in pos: ... print(x) ... 14 10 23 >>>
Sometimes, filtering rules are complex and cannot be simply expressed in list derivation or generator expression. For example, suppose you need to deal with some exceptions or other complex situations during filtering. At this time, you can put the filter code into a function and then use the built-in filter() function. Examples are as follows:
values = ['1', '2', '-3', '-', '4', 'N/A', '5'] def is_int(val): try: x = int(val) return True except ValueError: return False ivals = list(filter(is_int, values)) print(ivals) # Outputs ['1', '2', '-3', '4', '5']
The filter() function creates an iterator, so if you want to get a list, you have to use list() to convert it like the example.
List derivation and generator expressions are usually the easiest way to filter data. In fact, they can also convert data when filtering. For example:
>>> mylist = [1, 4, -5, 10, -7, 2, 3, -1] >>> import math >>> [math.sqrt(n) for n in mylist if n > 0] [1.0, 2.0, 3.1622776601683795, 1.4142135623730951, 1.7320508075688772] >>>
A variant of filtering is to replace unqualified values with new ones instead of discarding them. For example, in a column of data, you may not only want to find positive numbers, but also want to replace non positive numbers with specified numbers. This problem can be easily solved by putting the filter condition into the condition expression, as follows:
>>> clip_neg = [n if n > 0 else 0 for n in mylist] >>> clip_neg [1, 4, 0, 10, 0, 2, 3, 0] >>> clip_pos = [n if n < 0 else 0 for n in mylist] >>> clip_pos [0, 0, -5, 0, -7, 0, 0, -1] >>>
Another noteworthy filtering tool is itertools Compress (), which takes an iterable object and a corresponding Boolean selector sequence as input parameters. Then output the element in the iterable object whose corresponding selector is True. This function is very useful when you need to filter a sequence with another associated sequence. For example, suppose you have the following two columns of data:
addresses = [ '5412 N CLARK', '5148 N CLARK', '5800 E 58TH', '2122 N CLARK' '5645 N RAVENSWOOD', '1060 W ADDISON', '4801 N BROADWAY', '1039 W GRANVILLE', ] counts = [ 0, 3, 10, 4, 1, 7, 6, 1]
Now you want to output all the addresses whose corresponding count value is greater than 5. You can do this:
>>> from itertools import compress >>> more5 = [n > 5 for n in counts] >>> more5 [False, False, True, False, False, True, True, False] >>> list(compress(addresses, more5)) ['5800 E 58TH', '4801 N BROADWAY', '1039 W GRANVILLE'] >>>
The key point here is to create a Boolean sequence to indicate which elements compound conditions. Then the compress() function selects the element whose output position is True according to this sequence.
Similar to the filter() function, compress() is also an iterator returned. Therefore, if you need to get a list, you need to use list() to convert the result to list type.