First install Python 3.11 https://www.python.org/downloads/release/python-3116/
install Pip, update Pip, alias python 3.11 https://pip.pypa.io/en/stable/installation/
https://pip.pypa.io/en/stable/installation/
Install Zsh Shell: https://github.com/ohmyzsh/ohmyzsh/wiki/Installing-ZSH
edit Zshrc file sudo nano ~/.zshrc
set python alias for python 3.11
alias /usr/local/bin,
alias python='python3.11'
alias pip='pip3.11'
python -m pip install --upgrade pip
set $AIRFLOW_HOME system variable
export AIRFLOW_HOME=~/airflow
Install Airflow through Pip wheel (Cross Platform (Linux (WSL2)/Apple)
AIRFLOW_VERSION=2.8.0
# Extract the version of Python you have installed. If you're currently using a Python version that is not supported by Airflow, you may want to set this manually.
# See above for supported versions.
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)" CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt" # For example this would install 2.8.0 with python 3.8: https://raw.githubusercontent.com/apache/airflow/constraints-2.8.0/constraints-3.8.txt pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
install Swagger Python
pip install swaggerpy
Edit ZSHRC sudo nano ~/.zshrc
add export PATH=/Library/Frameworks/Python.framework/Versions/3.11/bin:$PATH
To the end of the .zshrc file
TO START AIRFLOW:
airflow scheduler
open up another terminal in VS-Code
airflow webserver --port 8081
use a sample CSV that you have of data to run this script you can adjust to any CSV:
Adjust your configuration file to the DAG directory that you want to use: The configuration file is in ~/airflow
It’s name is airflow.cfg
[core]
# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository. This path must be absolute.
#
# Variable: AIRFLOW__CORE__DAGS_FOLDER
#
dags_folder = /Users/####/Documents/luceeapp/airflows
Here is a sample DAG:
import pendulum
from airflow.datasets import Dataset
from airflow.models.dag import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator # Import PythonOperator
import pandas as pd
# [START dataset_def]
dag1_dataset = Dataset("/Users/####/Documents/luceeapp/testcsv/laptops.csv", extra={"hi": "bye"})
# ...
def sort_data(**kwargs):
# Load the dataset
dataset_path = "/Users/####/Documents/luceeapp/testcsv/laptops.csv"
df = pd.read_csv(dataset_path)
# Sort the data based on a specific column, for example 'column_to_sort'
sorted_df = df.sort_values(by='old_price')
# Save the sorted data back to the same file or a new one
sorted_df.to_csv(dataset_path, index=False)
# ...
with DAG(
dag_id="dataset_produces_1",
catchup=False,
start_date=pendulum.datetime(2024, 1, 9, tz="UTC"),
schedule="@daily",
tags=["new", "task"],
) as dag1:
# Define the dataset before using it in the DAG
dag1_dataset = Dataset("/Users/####/Documents/luceeapp/testcsv/laptops.csv", extra={"hi": "bye"})
# Previous task definition
bash_task = BashOperator(outlets=[dag1_dataset], task_id="New_Task", bash_command= "sleep 5")
# Add a PythonOperator for sorting the data
sort_data_task = PythonOperator(
task_id='sort_data_task',
python_callable=sort_data,
)
# Set task dependencies
bash_task >> sort_data_task # Adjust the dependencies as needed
It will run, like this:
The code basically runs a sort by a column and load the CSV back into the CSV, it can be adjusted for other things. This is just a basic coding example because literally poeple could do nothing to get this started.