Indexing and selecting data with pandas

| | | | | | | | |

Let’s take a look at an example of indexing in Pandas. In this article we are using " nba.csv ‚" file to upload CSV, click here .

Multiple row and multiple column selection

Let’s take a DataFrame with some fake data, now we are indexing this DataFrame. In this, we select multiple rows and multiple columns from the DataFrame. Data frame with dataset.

Suppose we want to select the columns Age , College and Salary only for rows labeled Amir Johnson and Terry Rozier

Our last DataFrame will look like this:

Select multiple rows and all columns

Let’s say we want to select the line Amir Jhonson , Terry Rozier and John Holland with all columns in the dataframe .

Our last DataFrame will be look like this:

Selecting some columns and all rows

Let’s say we want to select the Age, Height and Salary columns with all the rows in the dataframe.

Our last DataFrame will look like this:

Indexing pandas using [] , .loc [] , . iloc [] , Dataframe .loc [] : this function is used for labels.
  • Dataframe.iloc [] : this function is used for positions or integers
  • Dataframe.ix [] : this function is used for both labels and integers.
  • Collectively they are called indexers . These are by far the most common ways to index data. These are four functions that help you get elements, rows, and columns from a DataFrame.

    Indexing a Dataframe using the [] indexing operator:
    Operator indexing is used to refer to square brackets following an object. In

    # pandas package import

    import pandas as pd


    # create a data frame from a CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # extracting columns using the index operator

    first = data [ "Age" ]

    print (first)

    Exit:

    Selecting multiple columns

    To select multiple columns, we must pass a list of columns in the indexing statement.

    # import pandas package

    import pa ndas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = " Name " )


    # fetch multiple columns using the index operator

    first = data [[ "Age" , "College" , "Salary" ]]


    first

    Exit:

    Indexing the DataFrame using subsets of rows or columns. It can also select subsets of rows and columns at the same time.

    Selecting one row

    To select one row using .loc [] , we put a single line label in the .loc function.

    # pandas package import

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # extract string reading using loc method

    first = data .loc [ "Avery Bradley" ]

    second = data.loc [ "RJ Hunter" ]

    print (first, "" , second)

    Output:
    As shown in the output image, two series were returned as there was only one parameter both times.

    Selecting multiple lines

    To select multiple lines, we put all the line labels in a list and pass them to the function . loc .

    Output:

    Selecting two rows and three columns

    To select two rows and three columns, we select the two rows we want to select and three columns and put it in a separate list like this:

     Dataframe.loc [["row1", "row2"], ["column1", "column2", "column3"] ] 

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # get multiple lines using loc method

    first = data.loc [[ "Avery Bradley" , " RJ Hunter " ]]

    print (first)

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # extracting two rows and three columns using the loc method

    first = data.loc [[ "Avery Bradley" , "RJ Hunter" ] ,

    [ "Team" , "Number" , "Position" ]]

    print ( first)

    Output:

    Selecting all rows and some columns

    To select all rows and some columns, we use a single colon [:], to select all rows and a list of some of the columns we want to select as follows:

     Dataframe.loc [[:, ["column1", "column2", "column3"]] 

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # fetch all rows and some columns using the loc method

    first = data.loc [:, [ "Team" , "Number" , "Position" ]]

    print (first)

    Exit:

    Indexing DataFrame using . iloc [] :
    This function allows us to get rows and columns by position. To do this, we need to specify the positions of the rows we need, as well as the positions of the columns we need. df.iloc is very similar to df.loc but only uses integer locations for selection.

    Single line selection

    To select one line using .iloc [] , we can pass one integer to .iloc [] .

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # iloc extraction of rows

    row2 = data.iloc [ 3 ]

    print (row2)

    Exit:

    import pandas as pd


    # create a data frame from a CSV file

    data = pd.read_csv ( " nba.csv " , index_col = "Name" )


    # getting multiple lines using the iloc method

    row2 = data.iloc [[ 3 , 5 , 7 ]]


    row2

    Exit:

    Selecting two rows and two columns

    To select two rows and two columns, we create a list of 2 integers for strings and a list of 2 integers for columns, and then pass the function .iloc [] .

    Exit:

    Select all rows and some columns

    To select all rows and some columns, we use a single colon [:], to select all rows, and for columns, we compose a list of integers and then pass the function .iloc [] .

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # extraction of two rows and two columns using the iloc method

    row2 = data.iloc [[[ 3 , 4 ], [ 1 , 2 ]]

    print (row2)

    Exit:

    Indexing using .ix [] as . loc []

    To select one row, we put a single row label in the function. ix . This function acts like .loc [] if we pass the line label as an argument to the function.

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # extraction of all rows and some columns using the iloc method

    row2 = data.iloc [:, [ 1 , 2 ]]

    print (row2)

    # import pandas package

    import pandas as pd


    # create data frame from CSV file

    data = pd.read_csv ( " nba.csv " , index_col = "Name" )


    # getting a string using the ix method

    first = data.ix [ "Avery Bradley" ]

    print (first)

    Exit:

    Selecting one line using . ix [] as

    # import pandas package

    import pandas as pd


    # create a data frame from a CSV file

    data = pd.read_csv ( "nba.csv" , index_col = "Name" )


    # getting a string using the ix method

    first = data.ix [ 1 ]

    print (first)

    Exit:

    Indexing Methods in DataFrame

    Function Description
    Dataframe.head () Return top n rows of a data frame.
    Dataframe.tail () Return bottom n rows of a data frame.
    Dataframe.at [] Access a single value for a row / column label pair.
    Dataframe.iat [] Access a single value for a row / column pair by integer position .
    Dataframe.tail () Purely integer-location based indexing for selection by position.
    DataFrame.lookup () Label-based "fancy indexing‚" function for DataFrame.
    DataFrame.pop () Return item and drop from frame.
    DataFrame.xs() Returns a cross-section (row (s) or column (s)) from the DataFrame.
    DataFrame.get () Get item from object for given key (DataFrame column, Panel slice, etc.).
    DataFrame.isin () Return boolean DataFrame showing whether each element in the DataFrame is contained in values.
    DataFrame.where () Return an object of same shape as self and whose corresponding entr ies are from self where cond is True and otherwise are from other.

    How do I merge two dictionaries in a single expression (taking union of dictionaries)?

    5 answers

    Carl Meyer By Carl Meyer

    I have two Python dictionaries, and I want to write a single expression that returns these two dictionaries, merged (i.e. taking the union). The update() method would be what I need, if it returned its result instead of modifying a dictionary in-place.

    >>> x = {"a": 1, "b": 2}
    >>> y = {"b": 10, "c": 11}
    >>> z = x.update(y)
    >>> print(z)
    None
    >>> x
    {"a": 1, "b": 10, "c": 11}
    

    How can I get that final merged dictionary in z, not x?

    (To be extra-clear, the last-one-wins conflict-handling of dict.update() is what I"m looking for as well.)

    5839

    Answer #1

    How can I merge two Python dictionaries in a single expression?

    For dictionaries x and y, z becomes a shallowly-merged dictionary with values from y replacing those from x.

    • In Python 3.9.0 or greater (released 17 October 2020): PEP-584, discussed here, was implemented and provides the simplest method:

      z = x | y          # NOTE: 3.9+ ONLY
      
    • In Python 3.5 or greater:

      z = {**x, **y}
      
    • In Python 2, (or 3.4 or lower) write a function:

      def merge_two_dicts(x, y):
          z = x.copy()   # start with keys and values of x
          z.update(y)    # modifies z with keys and values of y
          return z
      

      and now:

      z = merge_two_dicts(x, y)
      

    Explanation

    Say you have two dictionaries and you want to merge them into a new dictionary without altering the original dictionaries:

    x = {"a": 1, "b": 2}
    y = {"b": 3, "c": 4}
    

    The desired result is to get a new dictionary (z) with the values merged, and the second dictionary"s values overwriting those from the first.

    >>> z
    {"a": 1, "b": 3, "c": 4}
    

    A new syntax for this, proposed in PEP 448 and available as of Python 3.5, is

    z = {**x, **y}
    

    And it is indeed a single expression.

    Note that we can merge in with literal notation as well:

    z = {**x, "foo": 1, "bar": 2, **y}
    

    and now:

    >>> z
    {"a": 1, "b": 3, "foo": 1, "bar": 2, "c": 4}
    

    It is now showing as implemented in the release schedule for 3.5, PEP 478, and it has now made its way into the What"s New in Python 3.5 document.

    However, since many organizations are still on Python 2, you may wish to do this in a backward-compatible way. The classically Pythonic way, available in Python 2 and Python 3.0-3.4, is to do this as a two-step process:

    z = x.copy()
    z.update(y) # which returns None since it mutates z
    

    In both approaches, y will come second and its values will replace x"s values, thus b will point to 3 in our final result.

    Not yet on Python 3.5, but want a single expression

    If you are not yet on Python 3.5 or need to write backward-compatible code, and you want this in a single expression, the most performant while the correct approach is to put it in a function:

    def merge_two_dicts(x, y):
        """Given two dictionaries, merge them into a new dict as a shallow copy."""
        z = x.copy()
        z.update(y)
        return z
    

    and then you have a single expression:

    z = merge_two_dicts(x, y)
    

    You can also make a function to merge an arbitrary number of dictionaries, from zero to a very large number:

    def merge_dicts(*dict_args):
        """
        Given any number of dictionaries, shallow copy and merge into a new dict,
        precedence goes to key-value pairs in latter dictionaries.
        """
        result = {}
        for dictionary in dict_args:
            result.update(dictionary)
        return result
    

    This function will work in Python 2 and 3 for all dictionaries. e.g. given dictionaries a to g:

    z = merge_dicts(a, b, c, d, e, f, g) 
    

    and key-value pairs in g will take precedence over dictionaries a to f, and so on.

    Critiques of Other Answers

    Don"t use what you see in the formerly accepted answer:

    z = dict(x.items() + y.items())
    

    In Python 2, you create two lists in memory for each dict, create a third list in memory with length equal to the length of the first two put together, and then discard all three lists to create the dict. In Python 3, this will fail because you"re adding two dict_items objects together, not two lists -

    >>> c = dict(a.items() + b.items())
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unsupported operand type(s) for +: "dict_items" and "dict_items"
    

    and you would have to explicitly create them as lists, e.g. z = dict(list(x.items()) + list(y.items())). This is a waste of resources and computation power.

    Similarly, taking the union of items() in Python 3 (viewitems() in Python 2.7) will also fail when values are unhashable objects (like lists, for example). Even if your values are hashable, since sets are semantically unordered, the behavior is undefined in regards to precedence. So don"t do this:

    >>> c = dict(a.items() | b.items())
    

    This example demonstrates what happens when values are unhashable:

    >>> x = {"a": []}
    >>> y = {"b": []}
    >>> dict(x.items() | y.items())
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unhashable type: "list"
    

    Here"s an example where y should have precedence, but instead the value from x is retained due to the arbitrary order of sets:

    >>> x = {"a": 2}
    >>> y = {"a": 1}
    >>> dict(x.items() | y.items())
    {"a": 2}
    

    Another hack you should not use:

    z = dict(x, **y)
    

    This uses the dict constructor and is very fast and memory-efficient (even slightly more so than our two-step process) but unless you know precisely what is happening here (that is, the second dict is being passed as keyword arguments to the dict constructor), it"s difficult to read, it"s not the intended usage, and so it is not Pythonic.

    Here"s an example of the usage being remediated in django.

    Dictionaries are intended to take hashable keys (e.g. frozensets or tuples), but this method fails in Python 3 when keys are not strings.

    >>> c = dict(a, **b)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: keyword arguments must be strings
    

    From the mailing list, Guido van Rossum, the creator of the language, wrote:

    I am fine with declaring dict({}, **{1:3}) illegal, since after all it is abuse of the ** mechanism.

    and

    Apparently dict(x, **y) is going around as "cool hack" for "call x.update(y) and return x". Personally, I find it more despicable than cool.

    It is my understanding (as well as the understanding of the creator of the language) that the intended usage for dict(**y) is for creating dictionaries for readability purposes, e.g.:

    dict(a=1, b=10, c=11)
    

    instead of

    {"a": 1, "b": 10, "c": 11}
    

    Response to comments

    Despite what Guido says, dict(x, **y) is in line with the dict specification, which btw. works for both Python 2 and 3. The fact that this only works for string keys is a direct consequence of how keyword parameters work and not a short-coming of dict. Nor is using the ** operator in this place an abuse of the mechanism, in fact, ** was designed precisely to pass dictionaries as keywords.

    Again, it doesn"t work for 3 when keys are not strings. The implicit calling contract is that namespaces take ordinary dictionaries, while users must only pass keyword arguments that are strings. All other callables enforced it. dict broke this consistency in Python 2:

    >>> foo(**{("a", "b"): None})
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: foo() keywords must be strings
    >>> dict(**{("a", "b"): None})
    {("a", "b"): None}
    

    This inconsistency was bad given other implementations of Python (PyPy, Jython, IronPython). Thus it was fixed in Python 3, as this usage could be a breaking change.

    I submit to you that it is malicious incompetence to intentionally write code that only works in one version of a language or that only works given certain arbitrary constraints.

    More comments:

    dict(x.items() + y.items()) is still the most readable solution for Python 2. Readability counts.

    My response: merge_two_dicts(x, y) actually seems much clearer to me, if we"re actually concerned about readability. And it is not forward compatible, as Python 2 is increasingly deprecated.

    {**x, **y} does not seem to handle nested dictionaries. the contents of nested keys are simply overwritten, not merged [...] I ended up being burnt by these answers that do not merge recursively and I was surprised no one mentioned it. In my interpretation of the word "merging" these answers describe "updating one dict with another", and not merging.

    Yes. I must refer you back to the question, which is asking for a shallow merge of two dictionaries, with the first"s values being overwritten by the second"s - in a single expression.

    Assuming two dictionaries of dictionaries, one might recursively merge them in a single function, but you should be careful not to modify the dictionaries from either source, and the surest way to avoid that is to make a copy when assigning values. As keys must be hashable and are usually therefore immutable, it is pointless to copy them:

    from copy import deepcopy
    
    def dict_of_dicts_merge(x, y):
        z = {}
        overlapping_keys = x.keys() & y.keys()
        for key in overlapping_keys:
            z[key] = dict_of_dicts_merge(x[key], y[key])
        for key in x.keys() - overlapping_keys:
            z[key] = deepcopy(x[key])
        for key in y.keys() - overlapping_keys:
            z[key] = deepcopy(y[key])
        return z
    

    Usage:

    >>> x = {"a":{1:{}}, "b": {2:{}}}
    >>> y = {"b":{10:{}}, "c": {11:{}}}
    >>> dict_of_dicts_merge(x, y)
    {"b": {2: {}, 10: {}}, "a": {1: {}}, "c": {11: {}}}
    

    Coming up with contingencies for other value types is far beyond the scope of this question, so I will point you at my answer to the canonical question on a "Dictionaries of dictionaries merge".

    Less Performant But Correct Ad-hocs

    These approaches are less performant, but they will provide correct behavior. They will be much less performant than copy and update or the new unpacking because they iterate through each key-value pair at a higher level of abstraction, but they do respect the order of precedence (latter dictionaries have precedence)

    You can also chain the dictionaries manually inside a dict comprehension:

    {k: v for d in dicts for k, v in d.items()} # iteritems in Python 2.7
    

    or in Python 2.6 (and perhaps as early as 2.4 when generator expressions were introduced):

    dict((k, v) for d in dicts for k, v in d.items()) # iteritems in Python 2
    

    itertools.chain will chain the iterators over the key-value pairs in the correct order:

    from itertools import chain
    z = dict(chain(x.items(), y.items())) # iteritems in Python 2
    

    Performance Analysis

    I"m only going to do the performance analysis of the usages known to behave correctly. (Self-contained so you can copy and paste yourself.)

    from timeit import repeat
    from itertools import chain
    
    x = dict.fromkeys("abcdefg")
    y = dict.fromkeys("efghijk")
    
    def merge_two_dicts(x, y):
        z = x.copy()
        z.update(y)
        return z
    
    min(repeat(lambda: {**x, **y}))
    min(repeat(lambda: merge_two_dicts(x, y)))
    min(repeat(lambda: {k: v for d in (x, y) for k, v in d.items()}))
    min(repeat(lambda: dict(chain(x.items(), y.items()))))
    min(repeat(lambda: dict(item for d in (x, y) for item in d.items())))
    

    In Python 3.8.1, NixOS:

    >>> min(repeat(lambda: {**x, **y}))
    1.0804965235292912
    >>> min(repeat(lambda: merge_two_dicts(x, y)))
    1.636518670246005
    >>> min(repeat(lambda: {k: v for d in (x, y) for k, v in d.items()}))
    3.1779992282390594
    >>> min(repeat(lambda: dict(chain(x.items(), y.items()))))
    2.740647904574871
    >>> min(repeat(lambda: dict(item for d in (x, y) for item in d.items())))
    4.266070580109954
    
    $ uname -a
    Linux nixos 4.19.113 #1-NixOS SMP Wed Mar 25 07:06:15 UTC 2020 x86_64 GNU/Linux
    

    Resources on Dictionaries

    5839

    Answer #2

    In your case, what you can do is:

    z = dict(list(x.items()) + list(y.items()))
    

    This will, as you want it, put the final dict in z, and make the value for key b be properly overridden by the second (y) dict"s value:

    >>> x = {"a":1, "b": 2}
    >>> y = {"b":10, "c": 11}
    >>> z = dict(list(x.items()) + list(y.items()))
    >>> z
    {"a": 1, "c": 11, "b": 10}
    
    

    If you use Python 2, you can even remove the list() calls. To create z:

    >>> z = dict(x.items() + y.items())
    >>> z
    {"a": 1, "c": 11, "b": 10}
    

    If you use Python version 3.9.0a4 or greater, then you can directly use:

    x = {"a":1, "b": 2}
    y = {"b":10, "c": 11}
    z = x | y
    print(z)
    
    {"a": 1, "c": 11, "b": 10}
    

    5839

    Answer #3

    An alternative:

    z = x.copy()
    z.update(y)
    

    Indexing and selecting data with pandas File handling: Questions

    Shop

    Best laptop for Fortnite

    $

    Best laptop for Excel

    $

    Best laptop for Solidworks

    $

    Best laptop for Roblox

    $

    Best computer for crypto mining

    $

    Best laptop for Sims 4

    $

    Best laptop for Zoom

    $499

    Best laptop for Minecraft

    $590

    Latest questions

    NUMPYNUMPY

    psycopg2: insert multiple rows with one query

    12 answers

    NUMPYNUMPY

    How to convert Nonetype to int or string?

    12 answers

    NUMPYNUMPY

    How to specify multiple return types using type-hints

    12 answers

    NUMPYNUMPY

    Javascript Error: IPython is not defined in JupyterLab

    12 answers

    Wiki

    Python OpenCV | cv2.putText () method

    numpy.arctan2 () in Python

    Python | os.path.realpath () method

    Python OpenCV | cv2.circle () method

    Python OpenCV cv2.cvtColor () method

    Python - Move item to the end of the list

    time.perf_counter () function in Python

    Check if one list is a subset of another in Python

    Python os.path.join () method