stage_L3/README.md
2024-07-02 03:58:06 +02:00

3.3 KiB

Description of the project

General informations

The execution is managed via the Makefile.

The python environment is managed via a virtual environment. Its configuration is standard. If you need to install a new python package, add it to the requirements.txt file (using pip syntax). It shoule be installed automatically when you execute the project. Anyway, you can run make requirements.txt

The installation of new databases (from csv) is managed in the Makefile.

Configuration

The configuration is stored the the src/config.yaml file.

Database-specific configuration

database_name should contain the name of the database to use. The database has to be stored in the proper directory structure (See the Directory structure > Datasets). This parameter is case sensitive.

Each database can have a separated and independent config. It is inside the key name like the database. For example, the database named SSB has its configuration under the SSB: key (and this configuration will be used only when database_name is SSB).

The following table explains every parameter that is used in the database specific configuration.

key type usage
orders_length integer The length of considered orderings
hypothesis_ordering list[str] The ordering to test the correctness of
parameter str The "parameter" attribute in the query (an attribute in the database).
authorized_parameter_values list[str] The restriction over possibles values in the query's orderings (WHERE parameter IN authorized_parameter_values).
summed_attribute str The database attribute that is summed in the aggregation, and used to order the values.
criterion list[str] The list of possibles values for the criteria in the query. When getting a random query, one of these values is chosen randomly for the criteria.

The query_generator key is a parameter containing the name of the query-generator object that is used when building the query. You should not modify this unless you modify the code accordingly.

Directory structure of the project

Virtual environment

The following folders and files are part of the python venv directory structure : bin/, include/, lib/, share/, and pyvenv.cfg.

The requirements.txt file lists the python packages required for the project. They should be already installed, but in case you reset the venv, you can reinstall them with python3 -m pip install -r requirements.txt

Source code

All python source code is inside the src/ directory.

Datasets

Datasets are stored inside specific directories.

Let's say you have a dataset named XLII.

  • All files relative to the dataset must be inside the XLII_dataset/ folder
  • The .csv files containing the original data must be placed inside the XLII_dataset/csv/ folder
  • The file containing the SQL code to create the tables with the correct schema must be in the XLII_dataset/create_tables.sql file

Obviously, you can replace XLII with any dataset name you want (I used flight_delay and SSB).

Then, if you run make reset, an SQLite database file named XLII_dataset/XLII.db will be created / overwritten. It will be initialized with the schema given in XLII_dataset/create_tables.sql, and populated with the data available in the csv/*.csv files.