Change language

Interaction of Python and FugueSQL in Jupyter Notebooks

|

The purpose of the FugueSQL language is to provide an advanced SQL interface for end-to-end data transfer. It allows you to fetch, transform, and load data into Python dataframes in Jupyter Notebooks. SQL code is parsed into code for Pandas, Spark or Dask.

This gives SQL users the Spark and Dask capabilities, in the programming language they want. In addition, FugueSQL offers keywords for distributed computing, such as PREPARTITION and PERSIST.

In this article, we will look at the basic features of FugueSQL and its use with Spark or Dask.

Benefits of FugueSQL

Пример использования FugueSQL c Python в Jupiter Notebooks

First, as you can see in the hype above, we can use the LOAD and SAVE keywords. Second, FugueSQL uses a friendlier syntax than SQL. Users can also call functions from Python in FugueSQL code.

SQL cells in Notebooks can be accessed through the magic %%fsql command. This makes Jupyter Notebooks syntax highlighting work. You can also use these SQL cells in Python code through the fsql() function.

Assigning variables

Dataframes can be assigned to variables. Similar to temporary SQL tables or generalized table expressions. Alternatively, dataframes can be derived from SQL cells and used in Python. In the example below, you can see how two dataframes are created by changing df. The df is obtained using Pandas in a Python cell (this is the df from the first picture). From these dataframes a final dataframe is obtained using JOIN.

Присваивание датафреймов переменным

Jinja templates

FugueSQL can communicate with Python through Jinja templates. This allows Python logic to modify SQL queries, just like parameters in SQL.

Пример использования переменной из Python к коде на FugueSQL

Python functions

Thanks to FugueSQL, you can use Python functions inside blocks of SQL code. In the example below, we use the seaborn library to draw a graph derived from two columns of a dataframe. The OUTPUT keyword is used to call the function.

Использование Python в Jupiter Notebooks

Comparison with ipython-sql

FugueSQL is designed to work with data already loaded into memory (working with data from storage is also possible). A project called ipython-sql allows you to use the magic command for %%sql cells. This command is for loading data into the Python environment from a database.

FugueSQL allows you to use the same SQL code in Pandas, Spark, and Dask without changing it. The focus of FugueSQL is on in-memory computation rather than loading data from the database.

Distributed computing in Spark and Dask

As the amount of data we work with continues to grow, distributed computing mechanisms such as Spark and Dask are becoming more popular. FugueSQL allows users to use these engines with the same FugueSQL code.

In the code snippet below, we changed the magic command from %%fsql to %%fsql spark, and now the SQL code will run on the Spark engine.

Пример использования FugueSQL и Spark

An example of using FugueSQL and Spark

One common operation for which moving to a distributed computing environment will be useful is getting the median of each group.

First, we define a Python function in Jupyter Notebooks that takes a data frame, outputs the user_id and the median dimension. This function is designed to handle only one user_id at a time. Even if the function is defined in Pandas, it will work on Spark and Dask.

#schema: user_id:int, measurement:int
def get_median(df:pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({'user_id': [df.iloc[0]['user_id']],
                         'median' : [df[['measurement']].median()]})

We can then use the PREPARTITION keyword to separate our data by user_id and apply the get_median function.

Использование ключевого слова PREPARTITION в FugueSQL c Python и Jupiter Notebooks

As the size of the data increases, paralleling will be more useful. In the example, the Pandas engine performed this operation for about 520 seconds. Spark (parallelized with 4 cores) took about 70 seconds for a dataset with 320 million rows.

Another common use of Dask is to handle memory leaks and write data to disk. This allows users to process more data before encountering memory shortage issues.

Installing FugueSQL in Jupyter Notebooks

Fugue (and FugueSQL) are available through PyPI. They can be installed using pip (Dask and Spark are installed separately).

pip install fugue

In Jupyter Notebook, after running the setup function, you can use the magic %%fsql command. This will allow SQL command syntax highlighting to work.

from fugue_notebook import setup
setup()

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

psycopg2: insert multiple rows with one query

12 answers

NUMPYNUMPY

How to convert Nonetype to int or string?

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Javascript Error: IPython is not defined in JupyterLab

12 answers

News


Wiki

Python OpenCV | cv2.putText () method

numpy.arctan2 () in Python

Python | os.path.realpath () method

Python OpenCV | cv2.circle () method

Python OpenCV cv2.cvtColor () method

Python - Move item to the end of the list

time.perf_counter () function in Python

Check if one list is a subset of another in Python

Python os.path.join () method